Tags: comet-ml/opik-mcp
Tags
feat(analytics): BI events for the hosted auth/HTTPS flow (GAP#1/#2/#3)… … (#148) * fix(analytics): sync McpHost/HostLlmFamily Literals with classifiers The privacy-allowlist contract in events.py promises the McpHost and HostLlmFamily Literals stay in sync with the classifiers in mcp_client_info.py, but the classifiers already emit buckets the Literals never declared: McpHost was missing zed/vscode/goose/librechat/5ire/ opencode/codex/gemini-cli, and HostLlmFamily was missing openai (codex) and google (gemini-cli). These values ship in BI today, violating the contract. Add sync tests that drive the classifier functions (covering the hardcoded 'other'/'unknown' fallbacks, not just the pattern tables) to pin the invariant, and add the missing values. Also correct the stale module docstring that pointed at wrappers.py instead of mcp_client_info.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(analytics): add EVENT_AUTH_REJECTED + bucket_path/PathBucket Schema foundation for the new auth-rejection BI event (GAP#3). Adds the event-name constant (exported from the analytics package surface), the PathBucket Literal, and bucket_path() which collapses a request path to a low-cardinality {mcp,health,well_known,other} bucket — never emitting the raw path. bucket_path honours a configurable MCP mount path so bucketing stays correct behind a path-prefix proxy, and uses exact-or-subpath matching so siblings like /mcpfoo or /healthz aren't mis-bucketed. bucket_path is exported alongside the other bucket_* helpers. The remaining new Literals (AuthMode, InstallationType, LifecycleSource, ResourceUriScheme, AuthRejectionReason) and the privacy-docstring clarification are introduced in later steps alongside their consumers, test-first. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(analytics): add boot_props module (settings-derived BI props) Single source of truth for the new auth/transport properties stamped on server_started / startup_error / auth_rejected, so __main__.main() and the build_app() lifespan (wired in Phases 3-4) emit identical values and cannot drift. Provides installation_type (delegates to error_tracking, 'self-hosted' hyphenated), resource_uri_scheme, oauth_configured(+from_env), dns_rebinding_protection, allowed_hosts_is_default (whitespace-insensitive, default read from the pydantic schema with an import-time type guard), auth_mode_at_boot (settings/outbound default — read alongside oauth_configured for hybrid deploys; per-request mode is resolved separately), and collect_boot_props(). Also defines LIFECYCLE_SENTINEL (env-var name used by Phases 3-4 to prevent double-emit) and the InstallationType/AuthMode/ ResourceUriScheme Literals these values are validated against. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(analytics): per-request OAuth identity in _build_event (GAP#2) In hosted HTTP/OAuth mode the per-request identity lives in the auth_context ContextVars but never reached BI — events fell back to the device install_id with no api_key_sha256. _build_event runs synchronously in the caller task, so those ContextVars are live at build time: derive auth_mode (always set, so the stdio/no-auth cohort is visible not dark), token_sha256 (sha256 of the opik_at_ bearer; raw token never stored), and request_workspace (plaintext, matching the existing workspace posture). installation_type is added to the common block so every event can be split cloud/self-hosted. Token extraction mirrors opik_client.resolve_opik_config exactly (same partition(" ") + lstrip + opik_at_ prefix check) so BI's auth_mode/token_sha256 agree with the credential actually forwarded outbound. Merge precedence is now {per_request, properties, common}: common stays authoritative; caller properties win over per-request so server_started's settings-derived auth_mode (Phase 3) beats the contextvar value. Privacy is guarded by direct _build_event tests — the recorder-based privacy tests intercept at track_event and never exercise _build_event. Documents the two declared privacy exceptions (identity hashes, plaintext workspace). Cross-team dependency: token_sha256 is a useful BI join key only once the backend exposes an opik_at_ -> user mapping; until then dashboards treat auth_mode='oauth' rows as identity_pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(analytics): boot props + lifecycle_source on lifecycle events server_started now spreads collect_boot_props(settings) (oauth_configured, resource_uri_scheme, dns_rebinding_protection, allowed_hosts_is_default, auth_mode) and lifecycle_source='main'. Spreading into the caller tier means the settings-derived auth_mode wins over _build_event's contextvar fallback, so a pure-OAuth deployment's server_started correctly reports auth_mode='oauth' (not 'none'). server_shutdown carries lifecycle_source; startup_error carries oauth_configured (read from raw env on the config-fail path). installation_type is intentionally NOT duplicated onto these events or into collect_boot_props: _build_event's common block is its single source of truth, stamped on every event (a duplicate would be dead — common wins the merge). The lifecycle sentinel set/get live in boot_props (mark_lifecycle_owned_by_main / lifecycle_owned_by_main) so main() and the Phase 4 build_app() lifespan share one source; main() sets it after setup_sentry, before build_app can run, so the lifespan skips its own emit and boots are never double-counted. conftest clears the sentinel (via the constant) between tests. Adds the LifecycleSource Literal and a sentinel round-trip test that locks the contract Phase 4 depends on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(analytics): emit lifecycle events from build_app() lifespan (GAP#1) The hosted entrypoint runs 'uvicorn opik_mcp.server:build_app --factory', which calls build_app() directly and never runs __main__.main() — so server_started/server_shutdown were 100% dark for the entire hosted fleet, exactly the deployment the OAuth/HTTPS work targets. build_app() now composes an analytics lifespan AROUND FastMCP's session-manager lifespan (captured before overwriting so the session manager still starts). It emits server_started on enter and server_shutdown on exit (reason=transport_error if the body raises), draining the flush off the event loop via run_in_executor. The shutdown finally guards with except BaseException (matching __main__) so an executor CancelledError during loop teardown can't mask the real shutdown reason. The lifespan-seconds anchor is captured at lifespan enter, not build_app() time, to exclude uvicorn startup latency. When main() owns lifecycle (the sentinel is set — and verified inherited by uvicorn --reload spawn workers via the OS environment), the lifespan runs the inner lifespan and emits nothing, so a main()-driven HTTP boot is never double-counted. The environment fingerprint is computed synchronously in build_app() (it shells out on macOS) and only when this process will emit. server_started/server_shutdown payloads come from the shared boot_props.server_started_props / server_shutdown_props helpers, used by BOTH main() and the lifespan so the two boot paths cannot drift. lifecycle_source distinguishes 'main' vs 'lifespan' in BI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(analytics): emit opik_mcp_auth_rejected for auth failures (GAP#3) Auth rejections were invisible to BI: BearerAuthMiddleware returns 401 (missing/malformed bearer) and the SDK transport-security guard returns 421/403 (Host/Origin) before any tool runs, so no tool_called ever fired. Auth-rejection rate is the single most important HTTPS-flow health signal. Adds AuthRejectionMiddleware — pure ASGI (never buffers streaming SSE; reads only the first response status line) wrapped outermost in build_app(). For 401/421/403 on AUTHENTICATED paths it emits opik_mcp_auth_rejected with a bucketed rejection_reason (missing_header/not_bearer/empty_token/token_rejected/ host_rejected/origin_rejected), auth_mode, path_bucket, oauth_configured, resource_metadata_url_present, and the cached env cohort. _UNAUTH_PATHS (health, OAuth-proxy, discovery) are skipped — a 401 proxied from the AS during the OAuth dance is not opik-mcp's rejection and would pollute the chart. Non-http scopes pass through so the composed lifespan still runs. auth_mode is derived from the request header inside the middleware, because it runs AFTER BearerAuthMiddleware reset the inbound-auth ContextVar — so a valid OAuth bearer rejected by the Host guard correctly reports auth_mode='oauth' (not the settings fallback). The bearer classification is factored into the shared auth_context.classify_bearer, used by both _build_event and the middleware so they cannot drift. rejection_reason is derived from status + header SHAPE only — the raw token never enters the event (canary test). _has_absolute_resource_metadata_url fixes the always-truthy _resource_metadata_url. build_app() now returns ASGIApp. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * test+refactor(analytics): privacy sweep for auth_rejected; break import cycle Extends the cross-event privacy sweep (test_analytics_privacy) to cover opik_mcp_auth_rejected with a canary OAuth bearer, asserting neither the token nor env values leak. Moves the pure installation_type(settings) derivation (cloud/local/self-hosted + host sets) from error_tracking into config (a leaf module). boot_props. installation_type previously delegated to error_tracking, which imports analytics.identity -> analytics.__init__ -> client -> (lazy) boot_props — a static import cycle mypy followed and reported as 'opik_mcp has no attribute server/error_tracking'. With the classifier in config, both error_tracking (Sentry tag) and boot_props (BI) import it cycle-free. The taxonomy test now targets config.installation_type at its source. Also makes the new test Settings(**base) helpers mypy-clean under the project's no_implicit_reexport config. Whole project green: mypy (107 files), ruff check + format, and 1061 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * feat(deploy): run hosted server via main() so all BI events fire (both modes) The Docker image ran 'uvicorn opik_mcp.server:build_app --factory', bypassing main() — so opik_mcp_startup_error (and the preflight bind check + OAuth-config guard) never fired in hosted mode, the one event the build_app() lifespan does not cover. Switch the entrypoint to 'python -m opik_mcp' so main() runs in hosted HTTP exactly as in stdio: it emits server_started/server_shutdown/ startup_error, runs the guards, then serves via uvicorn (single worker). The _OPIK_MCP_LIFECYCLE_OWNED_BY_MAIN sentinel (set before build_app()) makes the composed lifespan defer to main(), so boots are never double-counted — the new test asserts the sentinel is set at build_app() time and that lifecycle is emitted with lifecycle_source='main'. main()'s HTTP run now passes access_log=False (parity with the removed --no-access-log; keeps OAuth-flow query strings, which can carry tokens, out of stdout). Helm inherits the image entrypoint, so no chart change is needed. Net: every BI event now fires identically for stdio (local) and HTTP (hosted) installations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * refactor(analytics): apply code-review fixes From a full review of the branch: - auth_mode: the no-credential fallback in _build_event and AuthRejectionMiddleware was 2-way (api_key/none), missing 'oauth' for OAuth-only deploys. Extracted the settings-derived 3-way logic into auth_context.settings_auth_mode, now shared by client, middleware, and boot_props.auth_mode_at_boot so per-call and boot agree. - installation_type: computed ONCE in AnalyticsClient.__init__ (was an uncached urlparse + lazy import per event on the hot path); falls back to 'unknown' instead of silently dropping the key. - transport: lowercased in the common block so a mixed-case OPIK_MCP_TRANSPORT still emits a canonical BI value. - Extracted server._is_unauth_path(), shared by BearerAuthMiddleware and AuthRejectionMiddleware, so the skip-set can't drift between them. - Reload HTTP path now passes access_log=False too (was only on the non-reload path) — keeps OAuth-flow query strings out of dev logs. - _build_fallback_analytics_client re-reads COMET_URL_OVERRIDE/OPIK_URL so a self-hosted install's config-fail startup_error isn't mislabelled 'cloud'. - Lifespan flush captures the analytics client before run_in_executor (no in-thread singleton re-fetch); de-dup a double ws_header.strip(). Left intentionally as-is (documented): opik_client.resolve_opik_config still has its own opik_at_ detection (core auth path — out of scope to refactor here), and startup_error remains main()-only (the hosted entrypoint runs main(), so it is covered; the lifespan fallback path reports server_shutdown(transport_error)). Whole project green: ruff, ruff format, mypy (107 files), 1067 tests. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
feat(config): OPIK_WORKSPACE env var + optional workspace (default "d… …efault") [OPIK-6783] (#141) * feat(config): OPIK_WORKSPACE env var + optional workspace defaulting to "default" OPIK_WORKSPACE becomes the primary, OPIK_-prefixed env var for the workspace (matching the Opik SDK and opik-mcp's own OPIK_API_KEY). COMET_WORKSPACE stays as a deprecated backward-compat alias via pydantic AliasChoices; OPIK_WORKSPACE wins when both are set. No breaking change. The workspace is now optional: resolve_opik_config and require_ollie_config fall back to "default" (the Opik SDK's OPIK_WORKSPACE_DEFAULT_NAME) instead of raising MissingConfigError, so local/OSS users can run without setting one. The field stays str|None so the analytics has_workspace flag still reflects an explicit set. Instructions text, 401/403 error hints and the README now reference OPIK_WORKSPACE. OPIK-6783 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * docs(readme): note OPIK_WORKSPACE is optional and can be omitted in setup snippets Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
fix(analytics): retry transient POST failures so server_started lands (… …#140) The fire-and-forget worker did a single POST per event and dropped it on any failure. The first event of a session (server_started) pays the cold DNS+TLS handshake, so its POST routinely times out/errors and was lost forever — even though that failed attempt warms the pool and a retry would land. In prod ~87% of v0.1.7+ installs that emit tool_called have no server_started row at all. Fix: bounded retry-with-backoff in the worker (retry_backoff_s=(0,0.5,1.5)); the second attempt reuses the warmed connection. Widen the exit-flush deadline 2.0->3.5s so a cold attempt has room within the bounded drain. track_event stays non-blocking and never raises; delivery is at-least-once (BI dedups). Version -> 0.1.9. Scope: durable disk buffer (WAL) for SIGKILL / ultra-short sessions is a measured follow-up — re-run the diagnostic SQL post-release and add it only if we miss the >=99% target. Release alongside #136. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
PreviousNext