Skip to main content
LLM evaluation for the JVM

The LLM evaluation framework for Java and Kotlin.

Evaluate responses and agent tool calls, track quality over time, and catch regressions before they ship. Runs in the JUnit suite and CI you already have, one dependency, no new infrastructure. Framework agnostic, with one-line integrations for Spring AI, Spring AI Alibaba, LangChain4j, Koog, and Embabel.

  • Runs in JUnit and CI
  • Framework agnostic
  • MIT · Maven Central
  • Star on GitHub
RagEvalTest.java
class RagEvalTest {
  @Test  void answersAreCorrectAndFaithful() {    var result = Experiment.builder()        .dataset(Dataset.fromJson("qa-pairs.json"))        .task(example -> ragPipeline.answer(example.input()))        .evaluators(            new CorrectnessEvaluator(judge),            new FaithfulnessEvaluator(judge))        .build()        .run();
    // fail the build if quality drops below 90%    assertThat(result.passRate()).isGreaterThan(0.9);  }}

Get started your way

One dependency for humans. One line for agents. Pick your on-ramp.

Add the dependency

One line in your test scope. That is the whole install.

pom.xml
<dependency>    <groupId>dev.dokimos</groupId>    <artifactId>dokimos-junit</artifactId>    <version>0.23.0</version>    <scope>test</scope></dependency>

Pulls in dokimos-core. Gradle and the framework integration modules (Spring AI, Spring AI Alibaba, LangChain4j, Koog, Embabel) are in the install guide.

Write your first eval

Point the JUnit integration at a dataset and run it like any other test.

FirstEvalTest.java
@DatasetSource("qa-pairs.json")@EvalTestvoid evaluate(EvalTestCase testCase) {    String answer = ragPipeline.answer(testCase.input());
    assertThat(answer)        .satisfies(new CorrectnessEvaluator(judge));}

Runs in mvn test and your existing CI, no new services to stand up.

Dataset-driven evaluation

Load test cases from JSON or CSV, or build them in code. Run the same dataset across experiments and JUnit tests, and track quality as it changes.

Built-in and agent evaluators

Hallucination, faithfulness, contextual relevance, and LLM-as-judge, plus tool-call validity, trajectory, and task completion for agents.

Framework agnostic

The core depends on no AI framework, so it works with any LLM client. Optional one-line integrations cover Spring AI, Spring AI Alibaba, LangChain4j, Koog, Embabel, and JUnit.