Skip to content

Releases: dokimos-dev/dokimos

v0.22.0

09 Jun 13:45

Choose a tag to compare

What's Changed

Full Changelog: v0.21.0...v0.22.0

v0.21.0

06 Jun 20:57

Choose a tag to compare

What's Changed

Full Changelog: v0.20.0...v0.21.0

v0.20.0

04 Jun 17:32

Choose a tag to compare

What's Changed

  • Bump react-router from 7.12.0 to 7.15.0 in /dokimos-server/frontend in the npm_and_yarn group across 1 directory by @dependabot[bot] in #118
  • Add support for typed structured outputs by @fkapsahili in #117

Full Changelog: v0.19.0...v0.20.0

v0.19.0

03 Jun 05:59

Choose a tag to compare

What's Changed

  • Fix RecallEvaluator returning scores above 1.0 on duplicate retrieved items by @fkapsahili in #111
  • Make the docs and landing agent-friendly by @fkapsahili in #112
  • Fix mobile overflow and make docs search a Cmd+K modal by @fkapsahili in #114
  • Fix verified bugs and add API improvements across modules by @fkapsahili in #113

Full Changelog: v0.18.0...v0.19.0

v0.18.0

02 Jun 08:40

Choose a tag to compare

What's Changed

Full Changelog: v0.17.0...v0.18.0

v0.17.0

31 May 08:20

Choose a tag to compare

What's Changed

Full Changelog: v0.16.0...v0.17.0

v0.16.0

30 May 20:56

Choose a tag to compare

Overview

This release turns dokimos-server from a passive results store into a platform that closes the eval-driven-development loop. You can now author and version datasets on the server, gate regressions in CI, judge runs and live production traffic with an LLM on the server, get alerted when quality drops, and curate real failures back into your datasets, all from one place. Most of the work is in the server and its web UI, with supporting pieces in the SDK and a new GitHub Action.

Highlights

Server-authoritative datasets. Hold datasets on the server, versioned and shared, and reference an exact version from code with a dataset://name@version URI through the registry or @DatasetSource. The new SDK resolver in dokimos-server-client keeps an offline cache, so a pinned version still resolves when the server is briefly unreachable. (#81, #87)

CI regression gate and run diff. The server decides whether a run regressed against its baseline and can fail your build, with a significance gate (McNemar for pass/fail flips, a paired permutation test with a bootstrap interval otherwise) so a noisy judge does not flake your pipeline. A new item-by-item diff view shows exactly what moved. Ships with a reusable GitHub Action. (#78, #82, #83)

Server-side LLM judge. Score runs and traces on the server using a stored LLM connection. The judge speaks the vendor-neutral Open Responses API by default, with Chat Completions as a fallback, selectable per connection. A judge-vs-human alignment metric tells you how far to trust it. (#88, #90, #91, #92)

Production traces and online evaluation. Ingest OTLP traces from your running app and evaluate them online as they arrive, using the same judge machinery as offline experiments. Trace eval rules match spans by name or attribute and score them automatically, with a configurable retention window. (#93)

Regression alerting. Get a webhook (with an HMAC signature) when a run regresses against its baseline, using the same comparison the CI gate acts on, so a quality drop reaches your chat or on-call tooling without anyone watching a dashboard. (#94)

Review and curation loop. Review the items evaluators got wrong, annotate them with a verdict and a corrected expected output, and promote them into a new dataset version. A production miss becomes a regression test. (#85)

Role-scoped API keys (RBAC). Issue scoped API keys with VIEWER, EDITOR, or ADMIN roles alongside the existing single-key mode. Reads stay open, writes require EDITOR, and API-key management requires ADMIN. No-key and single-key deployments behave exactly as before. (#95, #96, #98)

Per-item token, cost, and latency metrics. Item results now carry token usage, cost, and latency so you can track spend and speed next to quality. (#89)

Other improvements

  • Lenient judge-response parsing tolerates trailing commas, single quotes, and other minor JSON slips from an LLM, so a slightly malformed judge reply no longer fails a run. (#97)
  • The server now serves the SPA on deep links, so refreshing on a route like /traces or /datasets/x loads the app instead of a 404. (#97)
  • Idempotent result ingestion and JSONB item storage on the server. (#78)
  • Web UI polish: semantic status colors, diff-view layout, button and dialog fixes. (#83, #86)
  • Comprehensive server documentation for every feature above.

Upgrade notes

  • Database migrations run automatically on server startup (Flyway V5 through V13). No manual steps.
  • Backward compatible. Existing single-key and no-key deployments are unchanged. Reads remain open; only API-key management now requires an ADMIN credential, which a single-key or no-key deployment already satisfies.
  • To use the server LLM judge with a stored API key, set DOKIMOS_ENCRYPTION_KEY (it encrypts connection credentials at rest). The server boots fine without it; you only need it to register or use a connection that holds an inline key.
  • Existing LLM connections are backfilled to the Chat Completions protocol; new connections default to Open Responses.
  • New optional settings: DOKIMOS_ENCRYPTION_KEY, DOKIMOS_TRACE_RETENTION_DAYS (default 30), and the judge and trace settings documented in Configuration (https://dokimos.dev/docs/server/configuration).

Full Changelog: v0.15.0...v0.16.0

v0.15.0

28 May 12:42
d367710

Choose a tag to compare

What's Changed

  • Bump the npm_and_yarn group across 2 directories with 1 update by @dependabot[bot] in #60
  • Bump node-forge from 1.3.3 to 1.4.0 in /docs in the npm_and_yarn group across 1 directory by @dependabot[bot] in #61
  • Bump org.springframework.ai:spring-ai-vector-store from 1.1.3 to 1.1.4 in /dokimos-examples in the maven group across 1 directory by @dependabot[bot] in #63
  • Bump brace-expansion from 1.1.12 to 1.1.13 in /docs in the npm_and_yarn group across 1 directory by @dependabot[bot] in #62
  • Bump the npm_and_yarn group across 2 directories with 2 updates by @dependabot[bot] in #64
  • Bump the npm_and_yarn group across 2 directories with 2 updates by @dependabot[bot] in #66
  • Bump the npm_and_yarn group across 2 directories with 1 update by @dependabot[bot] in #67
  • Bump the maven group across 2 directories with 2 updates by @dependabot[bot] in #71
  • Bump the npm_and_yarn group across 2 directories with 2 updates by @dependabot[bot] in #70
  • Bump the npm_and_yarn group across 2 directories with 4 updates by @dependabot[bot] in #72
  • Add MCP server for agents by @fkapsahili in #57
  • Allow publishing a single Docker image on manual dispatch by @fkapsahili in #73
  • Report the MCP server version from the pom by @fkapsahili in #74
  • Trigger docs deploy on MCP server changes by @fkapsahili in #75
  • Point Docusaurus url at live dokimos.dev custom domain by @fkapsahili in #76
  • Add Velm chat widget to docs by @fkapsahili in #77
  • Run comparison engine, idempotent ingestion, JSONB items by @fkapsahili in #78

Full Changelog: v0.14.2...v0.15.0

v0.14.2

23 Mar 14:26

Choose a tag to compare

What's Changed

  • Add code formatting with Spotless (Palantir + ktlint) by @fkapsahili in #53
  • Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot[bot] in #50
  • Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot[bot] in #54
  • Bump flatted from 3.3.3 to 3.4.2 in /dokimos-server/frontend in the npm_and_yarn group across 1 directory by @dependabot[bot] in #56
  • Bump org.springframework.ai:spring-ai-vector-store from 1.1.2 to 1.1.3 in /dokimos-examples in the maven group across 1 directory by @dependabot[bot] in #55
  • Fix ToolArgumentHallucinationEvaluator to ground arguments in preceding tool results by @fkapsahili in #58

Full Changelog: v0.14.1...v0.14.2

v0.14.1

28 Feb 19:04

Choose a tag to compare

What's Changed

  • Rework agent evaluator IT to use real OpenAI tool calling by @fkapsahili in #51

Full Changelog: v0.14.0...v0.14.1