Tags: METR/hawk
Tags
fix(api): make health checks independent of the database (#544) The Hawk API's /health ran DB + S3 checks and was wired to both the ALB target-group check and the ECS container check, so a database slowdown returned 503 — pulling the task from the load balancer and, on a sustained outage, letting ECS recycle it (the dev-faber 502 incident; a latent crash-loop risk in prod). Add a shallow, dependency-free /health/live and point both the ECS and ALB checks at it; keep the deep /health (DB + S3 + migrations) as an unprobed monitoring endpoint. A DB outage can no longer fail the health check, so the task is never de-registered or restarted because the database is down. On ECS+ALB the ALB health check governs task replacement (not just routing), so a DB-gated readiness check can't shed traffic without also restarting the task — hence a shallow check, not a liveness/readiness split. Verified on dev-faber1. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
fix(infra/datadog): qualify K8s resource keys with API groups (#468) The DatadogAgent CRD's kubernetesResourcesAnnotationsAsTags and kubernetesResourcesLabelsAsTags use schema.ParseGroupResource() to derive the API group from each key. Unqualified keys (e.g. 'statefulsets') parse as core API group ('' / v1), which is wrong for non-core resources. The Datadog operator then tries to create the 'datadog-cluster-agent-annotations-and-labels-as-tags' ClusterRole with rules like {APIGroups: [""], Resources: ["statefulsets"], Verbs: [list, watch]}, and K8s RBAC escalation prevention blocks it because the operator only holds those permissions in the proper groups (apps/, batch/, cilium.io/). Net effect: the ClusterRole is never created, and the cluster agent cannot tag Job or CiliumNetworkPolicy resources with their inspect-ai.metr.org/* annotations and labels. The operator retries every ~17 minutes, producing ~86 denied actions/day per environment in the EKS audit log (visible in Datadog at: service:operator "Dependencies apply error"). Co-authored-by: Rafael Carvalho <rafaelcarvalho@metr.org>
Route per-model Anthropic auth via Workload Identity Federation (#459) Add AnthropicCredentialBroker (middleman/src/middleman/anthropic_wif.py) that resolves an Anthropic credential per profile: either a static API key env var or a WIF lane. WIF lanes mint short-lived sk-ant-oat01-... tokens by exchanging an Okta client-credentials JWT at api.anthropic.com/v1/oauth/token. Refresh follows the WIF SDK contract (advisory exp-120s, mandatory exp-30s) with a single-flight asyncio lock per profile and a shared aiohttp session. ModelInfo gains anthropic_account; passthrough.py picks a profile via that field and swaps x-api-key for Authorization: Bearer when the credential is a bearer token. Unknown profile = fail-loud (500) so a typo doesn't silently route to the default Anthropic org. Models without anthropic_account keep the env-var API key path; no behavior change for them. Error mapping: - CredentialNotConfiguredError -> HTTP 500 (local misconfig: unknown profile name, missing Okta client_secret in SM). - CredentialExchangeError -> HTTP 502 (upstream IdP / Anthropic exchange failed). server.py loads profiles at boot (fails fast on bad config) and only invalidates the broker token cache from the periodic key refresh when SM contents actually changed (returned by provider_key_store.reload), so rotating an Okta client_secret takes effect within one cycle without forcing re-mint every 5 minutes. Pulumi wires MIDDLEMAN_ANTHROPIC_PROFILES from a new middlemanAnthropicProfiles config key (both from_pulumi_config and from_dev_env). Admin schema exposes anthropic_account on POST/PATCH /admin/models/ so the field is settable without direct DB writes. New smoke test tests/smoke/scenarios/test_middleman_anthropic_wif.py creates a temp WIF-routed model via the admin API, hits /v1/messages, asserts Anthropic answers, then cleans up. Skipped unless HAWK_SMOKE_WIF_PROFILE is set. Tests: 802 middleman tests pass (33 new broker tests, 2 new passthrough tests). End-to-end validated against a dev stack: broker logs anthropic_wif.exchange.ok, response routed via Authorization: Bearer. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refresh tailscale subnet router on user-data changes (#397) * fix: refresh tailscale subnet router on user-data changes The ASG referenced its launch template as `version="$Latest"` (a constant string from Pulumi's perspective), so changes to user-data produced new LT versions that Pulumi never saw as a diff on the ASG. The configured `instance_refresh` block therefore never fired, and the running EC2 instance kept its first-boot configuration indefinitely — prd has been running LT v6 from 2026-04-13 while the template is now at v11. Pin the ASG to `lt.latest_version` so Pulumi sees concrete version diffs and triggers a rolling refresh on each user-data change. Also fix a regression in #192 that broke multi-CIDR VPCs: the `head -1` on `vpc-ipv4-cidr-blocks` IMDS output kept only the primary CIDR, dropping return-traffic routing for any secondary CIDRs (10.51/10.52 in prd, 10.111/10.112 in stg). Loop over all CIDRs instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Revert unneeded user-data loop; subnet router SNATs forwarded traffic The head -1 OS route in PR #192 is intentional: Tailscale subnet routers SNAT subnet-routed traffic by default, so forwarded packets carry the router primary ENI IP as source. That source IP passes src/dst check on the primary ENI, and AWS handles VPC-internal routing to secondary CIDRs. No explicit per-CIDR OS routes are needed. Verified empirically on stg LT v15: head -1 only routes 10.110/16 via the persistent ENI, yet tailscale ping to 10.110.x.x works correctly and the device advertises + has approved all three stg CIDRs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
fix: update inspect_k8s_sandbox to valid commit (#382) * fix: update inspect_k8s_sandbox to valid commit The previous pin (725637fa) was force-pushed away from the upstream repo, causing all runner pods to fail on startup with: "failed to find branch, tag, or commit 725637fa..." Updates to 3419f2a3 (v0.5.0). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * chore: update stale comment for inspect_k8s_sandbox pin Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
fix(www): restore markdown list bullets/numbers in embedded viewers (#… …330) Tailwind preflight's `ol, ul, menu { list-style: none }` reset cascades into the embedded Inspect/Scout viewers and strips markers from markdown-rendered lists. Re-apply disc/decimal list styles within `.markdown-content`. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Upgrade inspect-ai to 0.3.207 and inspect-scout to 0.4.26 (#222) * chore: prepare release release/20260416144609 * feat: update prepare-release.py for ts-mono and npm web auth - Update viewer paths for ts-mono submodule structure - Add submodule init and pnpm monorepo install steps - Use pnpm pack + npm publish to resolve workspace:* deps - Strip @tsmono/* internal deps from package.json before publish - Add PTY-based npm web auth with webbrowser.open() for security key 2FA - Process packages sequentially to avoid overlapping auth prompts - Add --otp CLI flag, --force for git fetch tags - Remove private field from package.json before publish Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add explicit types in useInspectApi for new viewer library The updated inspect-log-viewer changes the return type of get_logs, causing implicit any errors in the flatMap/map callbacks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: add comments documenting cherry-picked PRs in uv sources Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * uv.lock * Include sandbox tools * chore: remove design doc from release branch Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix lint errors in prepare-release.py - StrEnum instead of (str, Enum) - Remove walrus operators from assert statements - Remove unnecessary variable before return - Remove stale noqa directive Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: apply prettier formatting to useInspectApi.ts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * ci: trigger fresh run after LFS fix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(ci): skip LFS smudge during python-lint install Release branches pin inspect-scout to a git commit on the METR fork, which may not have LFS objects available. The dist files aren't needed for linting or type-checking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(ci): unset checkout credentials before uv sync for LFS access actions/checkout sets http.https://github.com/.extraheader with a GITHUB_TOKEN scoped to METR/hawk. This causes git-lfs downloads to fail for other repos (e.g. inspect_scout) because the token isn't authorized for their LFS storage. Unsetting the header lets LFS fall back to unauthenticated access, which works for public repos. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix(ci): skip LFS smudge during uv sync and bust stale cache The uv cache persists broken LFS checkouts between runs. Using GIT_LFS_SKIP_SMUDGE=1 avoids the issue entirely — the viewer dist files in inspect-scout aren't needed for linting, type-checking, or tests. Added cache-suffix to invalidate the stale cache. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: pin inspect-scout to hotfix branch with de-LFS'd dist files The previous pin (75f3837e) had LFS-tracked dist files that weren't available on the METR fork, breaking uv install in CI and Docker builds. The hotfix branch has the same files committed as regular git objects. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: update inspect-scout pin to de-LFS'd hotfix (e63c1154) The METR fork's dist files are now stored as regular git objects instead of LFS pointers, fixing CI and deploy failures caused by GitHub forks not sharing LFS storage with upstream repos. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: update inspect-scout-viewer to rebuilt npm package The previous beta package was built from the wrong hotfix branch. Rebuild from the correct v0.4.26 + PR #367 source. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * revert: remove LFS workarounds from hawk-ci.yml No longer needed now that METR/inspect_scout hotfix branch stores dist files as regular git objects instead of LFS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * style: fix ruff formatting in test_sanitization.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: remove unnecessary pyright ignore comments in converter.py The match is no longer exhaustive with the updated inspect-ai types, so the suppress comments are now flagged as unnecessary. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: update eval log 404 test for new inspect-ai behavior inspect-ai now catches FileNotFoundError internally in api_log() and returns a plain 404 Response (no JSON body), so the exception never reaches hawk's exception handler. Update test to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * last bump * chore: upgrade inspect-ai from v0.3.206 to v0.3.207 (with same cherry-picks) Rebase METR fork hotfix branch onto 0.3.207 tag, keeping PRs #3376, add automatic npm login when unauthenticated, and fix stale terraform/modules glob path to services/modules. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: spec for retry error grouping in inspect log viewer Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: prepare release release/20260420083319 * chore: update all uv.lock files and document ts-mono PRs Run uv-lock-all.sh to catch lock files missed by prepare-release.py. Add ts-mono PR references to pyproject.toml comments. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: smoke test ECR sync and TUI exit behavior - Look up inspect_tasks_ecr_url instead of renamed docker_image_repo from Pulumi - Add staging fallback in _apply_env_overrides for cached envs missing source_image_repo - Handle Pulumi "does not have output property" error as missing key in get_stack_output - Exit TUI on full success instead of staying open - Remove unused _OUTPUT_KEYS constant Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: shield smoke test job submissions from cancellation When the smoke test TUI is quit mid-run, CancelledError could interrupt the API call after the server created the eval-set/scan but before the client registered it for cleanup, leaving orphaned resources. Wrap the submission in asyncio.shield so cancellation doesn't interrupt the HTTP call. If cancelled, await the task to get the job ID and register it for cleanup before re-raising. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Fix weekly release: set GH_TOKEN for gh CLI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
perf: lazy-load scan_events and optimize S3 parquet reads (#1000) ## Overview Dramatically improves scan viewer load times by lazy-loading `scan_events` and optimizing S3 parquet reads. **Issue:** Scan pages taking 2-5+ minutes to load due to large `scan_events` columns being included in the initial Arrow stream. ## Approach and Alternatives - **Lazy-load `scan_events`**: Excluded from the initial Arrow IPC stream. Fetched on-demand via a new `/fields` endpoint when viewing individual result details. - **Optimized S3 reads**: Uses PyArrow's native S3FileSystem with `pre_buffer` for efficient range requests instead of reading entire parquet files into memory. ## Testing & Validation - [x] Covered by automated tests (inspect_scout test suite: 33/33 passing) - [x] Manual testing: Deployed to dev1, verified scan page loads in ~10s vs 2-5min previously - [ ] Manual testing instructions: 1. Open a scan page with large scan results 2. Verify initial load is fast (scan_events not in initial stream) 3. Click on individual results to verify scan_events loads via /fields endpoint ## Checklist - [x] Code follows the project's style guidelines - [x] Self-review completed - [x] Documentation updated (if applicable) - [x] Tests added or updated (if applicable) ## Additional Context meridianlabs-ai/inspect_scout#367 Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>