eval

RCA Evaluation Framework

Fixture-based evaluation system for testing Tarka's RCA quality.

Overview

This framework enables repeatable, CI-friendly testing of the agent's root cause analysis (RCA) capabilities. Instead of running against live clusters, it captures complete Investigation objects from real incidents and replays them deterministically.

Key Principles

Investigation-Centric: Capture the fully populated Investigation object (SSOT) rather than mocking individual provider calls
CI-Friendly: No live cluster dependencies, fast execution, deterministic results
Quality-Focused: Structured scoring against expected outcomes ("How close was the agent to finding the actual failure?")
Extensible: Easy to add new scenarios without code changes

Quick Start

1. Capture a Fixture from Live Cluster

# Interactive capture (recommended for first time)
poetry run python -m eval.tools.capture \
  --interactive \
  --alert-index 0 \
  --output eval/fixtures/my_scenario

# Or non-interactive
poetry run python -m eval.tools.capture \
  --fingerprint abc123 \
  --output eval/fixtures/my_scenario \
  --scenario-name "Job ImagePullBackOff" \
  --failure-type image_pull

This creates:

investigation.json - Captured Investigation object with all evidence
scenario.yaml - Metadata + expected outcomes (template, needs editing)
README.md - Human-readable documentation (template, needs editing)

2. Edit Expected Outcomes

Edit eval/fixtures/my_scenario/scenario.yaml to define:

expected_outcomes:
  root_cause:
    patterns:
      - "ImagePullBackOff"
      - "image.*not found"
    match_type: "regex"  # exact, substring, regex, semantic

  proposed_fix:
    all_of:  # Must contain ALL of these
      - patterns: ["kubectl describe pod"]
        match_type: "substring"
    any_of:  # Must contain AT LEAST ONE
      - patterns: ["imagePullSecret", "ECR.*auth"]
        match_type: "regex"

  hypotheses:
    any_of: ["image not found", "authentication"]

  next_steps:
    command_types: ["kubectl", "aws"]
    must_include: ["describe pod"]

scoring:
  root_cause_weight: 0.4        # 40% of total score
  fix_accuracy_weight: 0.3      # 30%
  hypothesis_quality_weight: 0.2  # 20%
  next_steps_weight: 0.1        # 10%
  pass_threshold: 70            # Minimum score to pass

3. Run Evaluation

# Run all scenarios
poetry run pytest eval/runner.py -v

# Run specific scenario
poetry run pytest eval/runner.py::test_my_scenario -v

# Run with keyword filter
poetry run pytest eval/runner.py -k "image" -v

# Generate HTML report
poetry run pytest eval/runner.py --html=eval_report.html --self-contained-html

# Enable LLM enrichment (optional, slower)
poetry run pytest eval/runner.py --enable-llm -v

Architecture

Why Investigation-Centric?

Instead of mocking dozens of individual provider calls (complex request-response matching), we capture the fully populated Investigation object after evidence collection:

Real Cluster (once):
  Alert → run_investigation() → Investigation (with all evidence) → Save as fixture

CI/Local (many times):
  Load Investigation from fixture → Run analysis → Score RCA quality

Benefits:

Investigation is already the SSOT (Single Source of Truth)
Simpler capture: One JSON file vs. tracking dozens of provider calls
No request-response matching complexity
Better for LLM evaluation: LLM gets full Investigation context
Easier to version and diff: Compare fixtures across agent versions

What Gets Tested?

The framework tests analysis and reasoning, not evidence collection:

✅ Tested (replayed from fixture):

Diagnostic module analysis
Base triage decision building
Hypothesis generation and scoring
Feature computation
Family enrichment
LLM RCA (optional)

❌ Not tested (already captured in fixture):

Prometheus queries
Kubernetes API calls
Log fetching
Evidence collection logic

Fixture Format

Each scenario directory contains:

eval/fixtures/my_scenario/
├── investigation.json    # Captured Investigation object (SSOT)
├── scenario.yaml        # Metadata + expected outcomes + scoring config
└── README.md           # Human-readable documentation

scenario.yaml Structure

name: "Human-readable scenario name"
family: "pod_not_healthy"
failure_type: "image_pull"
description: "Detailed description of failure"

captured_at: "2026-02-18T10:05:00Z"
cluster: "prod-cluster"

expected_outcomes:
  # What root cause should be identified?
  root_cause:
    patterns: ["ImagePullBackOff", "image.*not found"]
    match_type: "regex"

  # What failure mode should be classified?
  failure_mode:
    exact: "image_pull"

  # What should the proposed fix include?
  proposed_fix:
    all_of: [...]  # Must have ALL
    any_of: [...]  # Must have AT LEAST ONE

  # What should hypotheses mention?
  hypotheses:
    any_of: ["image not found", "authentication"]

  # What should next steps include?
  next_steps:
    command_types: ["kubectl", "aws"]
    must_include: ["describe pod"]

scoring:
  root_cause_weight: 0.4
  fix_accuracy_weight: 0.3
  hypothesis_quality_weight: 0.2
  next_steps_weight: 0.1
  pass_threshold: 70

test_config:
  time_window: "1h"
  enable_llm: false
  enable_diagnostics: true

investigation.json Structure

This is the complete Investigation object serialized to JSON:

{
  "alert": {...},
  "time_window": {...},
  "target": {...},
  "evidence": {
    "k8s": {...},
    "metrics": {...},
    "logs": {...}
  },
  "analysis": {
    "decision": {...},
    "hypotheses": [...],
    "rca": {...}
  }
}

Scoring System

RCA quality is scored on four components:

1. Root Cause Identification (40% default)

What it checks: Does the agent correctly identify the failure?

Scoring logic:

100 pts: Found in RCA root_cause field
90 pts: Mentioned in base decision why bullets
80 pts: Found in high-confidence hypothesis (≥80%)
0 pts: Not found

Pass threshold: ≥70

2. Fix Accuracy (30% default)

What it checks: Would the proposed fix actually work?

Scoring logic:

100 pts: All required elements present (all_of) AND at least one optional element (any_of)
50 pts: Some optional elements present but missing required elements
30 pts: Has fix content but missing key elements
0 pts: No fix proposed

Pass threshold: ≥60

3. Hypothesis Quality (20% default)

What it checks: Are hypotheses relevant to the actual failure?

Scoring logic:

100 pts: At least one hypothesis mentions expected failure mode
40 pts: Hypotheses exist but don't mention expected failure mode
0 pts: No hypotheses

Pass threshold: ≥50

4. Next Steps (10% default)

What it checks: Are next steps actionable?

Scoring logic:

100 pts: All expected command types present
Proportional: N/M * 100 where N = found command types, M = expected
50 pts: Missing must-include patterns

Pass threshold: ≥50

Total Score

Total = (root_cause × 0.4) + (fix_accuracy × 0.3) + (hypothesis_quality × 0.2) + (next_steps × 0.1)

Test passes if: Total ≥ pass_threshold (default: 70)

Pattern Matching

Four match types available:

1. exact

patterns: ["ImagePullBackOff"]
match_type: "exact"

Text must exactly match pattern (case-insensitive).

2. substring

patterns: ["kubectl describe pod"]
match_type: "substring"

Text must contain pattern (case-insensitive).

3. regex

patterns: ["image.*not found", "registry.*(unavailable|unreachable)"]
match_type: "regex"

Text must match regex pattern (case-insensitive, DOTALL).

4. semantic

patterns: ["container startup failure"]
match_type: "semantic"

Future: Use embeddings for semantic similarity. Currently falls back to substring matching.

CI Integration

GitHub Actions

Add to .github/workflows/eval.yml:

name: RCA Evaluation Tests

on:
  pull_request:
  push:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install poetry
          poetry install

      - name: Run evaluation tests
        run: |
          poetry run pytest eval/runner.py \
            --junitxml=eval_results.xml \
            --html=eval_report.html \
            --self-contained-html

      - name: Upload results
        if: always()
        uses: actions/upload-artifact@v3
        with:
          name: eval-results
          path: |
            eval_results.xml
            eval_report.html

      - name: Comment PR with results
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v6
        with:
          script: |
            // Parse eval_results.xml and post summary comment

Best Practices

Capturing Fixtures

Capture from real incidents: Don't synthesize; capture actual failures
Wait for evidence: Let investigation complete before capturing
Document context: Fill in README.md with failure details
Verify completeness: Check that investigation.json has all expected evidence

Defining Expected Outcomes

Start broad: Use regex patterns for flexibility
Focus on outcomes: "What should be found?" not "How should it be found?"
Be realistic: Agent won't always get 100%; set thresholds appropriately
Test incrementally: Run test after defining each outcome to verify scoring

Organizing Scenarios

eval/fixtures/
├── job_failure_imagepullbackoff/
├── job_failure_oom/
├── pod_not_healthy_crashloop/
├── pod_not_healthy_liveness/
├── cpu_throttling_high/
└── http_5xx_spike/

Group by alert family and failure type. Use descriptive directory names.

Troubleshooting

Test fails with "Eval test tried to make live cluster call!"

This means the replay mechanism is broken. Check:

Is investigation.json complete?
Is evidence collection being triggered during replay?
Check eval/tools/replay.py - should call diagnostics with do_collect=False

Score is too low

Run with verbose output: pytest eval/runner.py::test_scenario -vv
Check "Diagnostic Information" section in output
Compare agent's output to expected outcomes
Adjust expected outcomes or improve agent

Fixture capture fails

Check that alert is still firing
Verify cluster access (kubectl, Prometheus)
Check that time window is appropriate
Review agent logs for errors

Development

Adding New Scenarios

# 1. Capture fixture
poetry run python -m eval.tools.capture --interactive --alert-index 0 --output eval/fixtures/new_scenario

# 2. Edit scenario.yaml
vim eval/fixtures/new_scenario/scenario.yaml

# 3. Document in README.md
vim eval/fixtures/new_scenario/README.md

# 4. Run test
poetry run pytest eval/runner.py::test_new_scenario -v

# 5. Iterate until passing

Extending Scoring Logic

Edit eval/scoring/scorer.py to add new scoring components or modify existing logic.

Adding New Match Types

Edit eval/scoring/matchers.py to add new pattern matching strategies (e.g., semantic similarity with embeddings).

Future Enhancements

Semantic matching with embeddings
Diff tool to compare expected vs actual
Trend tracking: score changes over time
LLM-as-judge for subjective quality assessment
Automated fixture capture from production alerts
Regression detection: alert if scores drop

Name		Name	Last commit message	Last commit date
parent directory ..
fixtures		fixtures
scoring		scoring
tools		tools
QUICKSTART.md		QUICKSTART.md
README.md		README.md
__init__.py		__init__.py
conftest.py		conftest.py
runner.py		runner.py

FilesExpand file tree

eval

Directory actions

More options

Directory actions

More options

Latest commit

History

eval

Folders and files

parent directory

README.md

RCA Evaluation Framework

Overview

Key Principles

Quick Start

1. Capture a Fixture from Live Cluster

2. Edit Expected Outcomes

3. Run Evaluation

Architecture

Why Investigation-Centric?

What Gets Tested?

Fixture Format

scenario.yaml Structure

investigation.json Structure

Scoring System

1. Root Cause Identification (40% default)

2. Fix Accuracy (30% default)

3. Hypothesis Quality (20% default)

4. Next Steps (10% default)

Total Score

Pattern Matching

1. exact

2. substring

3. regex

4. semantic

CI Integration

GitHub Actions

Best Practices

Capturing Fixtures

Defining Expected Outcomes

Organizing Scenarios

Troubleshooting

Test fails with "Eval test tried to make live cluster call!"

Score is too low

Fixture capture fails

Development

Adding New Scenarios

Extending Scoring Logic

Adding New Match Types

Future Enhancements

Related Documentation