Releases · dokimos-dev/dokimos

@fkapsahili

What's Changed

Add a server-free regression gate for evals by @fkapsahili in #133

Full Changelog: v0.21.0...v0.22.0

@fkapsahili

What's Changed

Reworks docs for more clarity by @fkapsahili in #119
Fix changelog Latest badge by @fkapsahili in #120
Add Embabel and Spring AI Alibaba agent adapters by @fkapsahili in #122
Add cost, token, and latency metrics across all five adapters by @fkapsahili in #123

Full Changelog: v0.20.0...v0.21.0

@fkapsahili

What's Changed

Bump react-router from 7.12.0 to 7.15.0 in /dokimos-server/frontend in the npm_and_yarn group across 1 directory by @dependabot[bot] in #118
Add support for typed structured outputs by @fkapsahili in #117

Full Changelog: v0.19.0...v0.20.0

@fkapsahili

What's Changed

Fix RecallEvaluator returning scores above 1.0 on duplicate retrieved items by @fkapsahili in #111
Make the docs and landing agent-friendly by @fkapsahili in #112
Fix mobile overflow and make docs search a Cmd+K modal by @fkapsahili in #114
Fix verified bugs and add API improvements across modules by @fkapsahili in #113

Full Changelog: v0.18.0...v0.19.0

@fkapsahili

What's Changed

Design system for the web UI and docs by @fkapsahili in #106
Framework trace extractors for agent evaluation by @fkapsahili in #107

Full Changelog: v0.17.0...v0.18.0

@fkapsahili

What's Changed

Accept protobuf-encoded OTLP traces by @fkapsahili in #99
Add changelog and frame the production eval loop by @fkapsahili in #100
Isolate data by tenant for scoped API keys by @fkapsahili in #102
Patch docs dependency advisories by @fkapsahili in #104
Deterministic trajectory and tool-result evaluators by @fkapsahili in #103

Full Changelog: v0.16.0...v0.17.0

Overview

This release turns dokimos-server from a passive results store into a platform that closes the eval-driven-development loop. You can now author and version datasets on the server, gate regressions in CI, judge runs and live production traffic with an LLM on the server, get alerted when quality drops, and curate real failures back into your datasets, all from one place. Most of the work is in the server and its web UI, with supporting pieces in the SDK and a new GitHub Action.

Highlights

Server-authoritative datasets. Hold datasets on the server, versioned and shared, and reference an exact version from code with a dataset://name@version URI through the registry or @DatasetSource. The new SDK resolver in dokimos-server-client keeps an offline cache, so a pinned version still resolves when the server is briefly unreachable. (#81, #87)

CI regression gate and run diff. The server decides whether a run regressed against its baseline and can fail your build, with a significance gate (McNemar for pass/fail flips, a paired permutation test with a bootstrap interval otherwise) so a noisy judge does not flake your pipeline. A new item-by-item diff view shows exactly what moved. Ships with a reusable GitHub Action. (#78, #82, #83)

Server-side LLM judge. Score runs and traces on the server using a stored LLM connection. The judge speaks the vendor-neutral Open Responses API by default, with Chat Completions as a fallback, selectable per connection. A judge-vs-human alignment metric tells you how far to trust it. (#88, #90, #91, #92)

Production traces and online evaluation. Ingest OTLP traces from your running app and evaluate them online as they arrive, using the same judge machinery as offline experiments. Trace eval rules match spans by name or attribute and score them automatically, with a configurable retention window. (#93)

Regression alerting. Get a webhook (with an HMAC signature) when a run regresses against its baseline, using the same comparison the CI gate acts on, so a quality drop reaches your chat or on-call tooling without anyone watching a dashboard. (#94)

Review and curation loop. Review the items evaluators got wrong, annotate them with a verdict and a corrected expected output, and promote them into a new dataset version. A production miss becomes a regression test. (#85)

Role-scoped API keys (RBAC). Issue scoped API keys with VIEWER, EDITOR, or ADMIN roles alongside the existing single-key mode. Reads stay open, writes require EDITOR, and API-key management requires ADMIN. No-key and single-key deployments behave exactly as before. (#95, #96, #98)

Per-item token, cost, and latency metrics. Item results now carry token usage, cost, and latency so you can track spend and speed next to quality. (#89)

Other improvements

Lenient judge-response parsing tolerates trailing commas, single quotes, and other minor JSON slips from an LLM, so a slightly malformed judge reply no longer fails a run. (#97)
The server now serves the SPA on deep links, so refreshing on a route like /traces or /datasets/x loads the app instead of a 404. (#97)
Idempotent result ingestion and JSONB item storage on the server. (#78)
Web UI polish: semantic status colors, diff-view layout, button and dialog fixes. (#83, #86)
Comprehensive server documentation for every feature above.

Upgrade notes

Database migrations run automatically on server startup (Flyway V5 through V13). No manual steps.
Backward compatible. Existing single-key and no-key deployments are unchanged. Reads remain open; only API-key management now requires an ADMIN credential, which a single-key or no-key deployment already satisfies.
To use the server LLM judge with a stored API key, set DOKIMOS_ENCRYPTION_KEY (it encrypts connection credentials at rest). The server boots fine without it; you only need it to register or use a connection that holds an inline key.
Existing LLM connections are backfilled to the Chat Completions protocol; new connections default to Open Responses.
New optional settings: DOKIMOS_ENCRYPTION_KEY, DOKIMOS_TRACE_RETENTION_DAYS (default 30), and the judge and trace settings documented in Configuration (https://dokimos.dev/docs/server/configuration).

Full Changelog: v0.15.0...v0.16.0

@fkapsahili

What's Changed

Bump the npm_and_yarn group across 2 directories with 1 update by @dependabot[bot] in #60
Bump node-forge from 1.3.3 to 1.4.0 in /docs in the npm_and_yarn group across 1 directory by @dependabot[bot] in #61
Bump org.springframework.ai:spring-ai-vector-store from 1.1.3 to 1.1.4 in /dokimos-examples in the maven group across 1 directory by @dependabot[bot] in #63
Bump brace-expansion from 1.1.12 to 1.1.13 in /docs in the npm_and_yarn group across 1 directory by @dependabot[bot] in #62
Bump the npm_and_yarn group across 2 directories with 2 updates by @dependabot[bot] in #64
Bump the npm_and_yarn group across 2 directories with 2 updates by @dependabot[bot] in #66
Bump the npm_and_yarn group across 2 directories with 1 update by @dependabot[bot] in #67
Bump the maven group across 2 directories with 2 updates by @dependabot[bot] in #71
Bump the npm_and_yarn group across 2 directories with 2 updates by @dependabot[bot] in #70
Bump the npm_and_yarn group across 2 directories with 4 updates by @dependabot[bot] in #72
Add MCP server for agents by @fkapsahili in #57
Allow publishing a single Docker image on manual dispatch by @fkapsahili in #73
Report the MCP server version from the pom by @fkapsahili in #74
Trigger docs deploy on MCP server changes by @fkapsahili in #75
Point Docusaurus url at live dokimos.dev custom domain by @fkapsahili in #76
Add Velm chat widget to docs by @fkapsahili in #77
Run comparison engine, idempotent ingestion, JSONB items by @fkapsahili in #78

Full Changelog: v0.14.2...v0.15.0

@fkapsahili

What's Changed

Add code formatting with Spotless (Palantir + ktlint) by @fkapsahili in #53
Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot[bot] in #50
Bump the npm_and_yarn group across 1 directory with 2 updates by @dependabot[bot] in #54
Bump flatted from 3.3.3 to 3.4.2 in /dokimos-server/frontend in the npm_and_yarn group across 1 directory by @dependabot[bot] in #56
Bump org.springframework.ai:spring-ai-vector-store from 1.1.2 to 1.1.3 in /dokimos-examples in the maven group across 1 directory by @dependabot[bot] in #55
Fix ToolArgumentHallucinationEvaluator to ground arguments in preceding tool results by @fkapsahili in #58

Full Changelog: v0.14.1...v0.14.2

@fkapsahili

What's Changed

Rework agent evaluator IT to use real OpenAI tool calling by @fkapsahili in #51

Full Changelog: v0.14.0...v0.14.1

Releases: dokimos-dev/dokimos

v0.22.0

What's Changed

Contributors

Uh oh!

v0.21.0

What's Changed

Contributors

Uh oh!

v0.20.0

What's Changed

Contributors

Uh oh!

v0.19.0

What's Changed

Contributors

Uh oh!

v0.18.0

What's Changed

Contributors

Uh oh!

v0.17.0

What's Changed

Contributors

Uh oh!

v0.16.0

Overview

Highlights

Other improvements

Upgrade notes

Uh oh!

v0.15.0

What's Changed

Contributors

Uh oh!

v0.14.2

What's Changed

Contributors

Uh oh!

v0.14.1

What's Changed

Contributors

Uh oh!