A layered testing system for measuring Cline's performance at different levels.
Note: Smoke tests (Layer 2) are partially disabled while the eval framework is repointed at the new SDK CLI. The scenarios under
evals/smoke-tests/are preserved andnpm run eval:smoke:runstill works against whateverclineis on$PATH(install withnpm i -g cline). The old build-and-link helpers and the auto-runningcline-evals-regression.ymlworkflow are off until someone wires the build step at the new SDK CLI.
evals/
├── smoke-tests/ # Quick provider validation (minutes)
│ ├── run-smoke-tests.ts
│ └── scenarios/ # 5 curated test scenarios
│
├── e2e/ # Full E2E with cline-bench (hours)
│ └── run-cline-bench.ts
│
├── cline-bench/ # Real-world tasks (git submodule)
│ └── tasks/ # 12 production bug fixes
│
├── analysis/ # Metrics and reporting framework
│ ├── src/
│ │ ├── metrics.ts # pass@k, pass^k calculations
│ │ ├── classifier.ts # Failure pattern matching
│ │ └── reporters/ # Markdown, JSON output
│ └── patterns/
│ └── cline-failures.yaml
│
└── baselines/ # Performance baselines for regression detection
Location: src/core/api/transform/__tests__/
Tests API transform logic without LLM calls:
- Thinking trace preservation
- Tool call parsing (XML, native formats)
- Provider format conversions
npm run test:unit -- --grep "Thinking\|Tool Call"Location: evals/smoke-tests/
Quick validation across providers with real LLM calls:
- 5 curated scenarios
- 3 trials per test for pass@k metrics
- Runs the
clineCLI with--config,-y,-t, and-m
# Set API key (Cline provider)
export CLINE_API_KEY=sk-...
# Run smoke tests
npm run eval:smoke:run
# Run specific scenario
npm run eval:smoke:run -- --scenario 01-create-file
# Run with specific model (overrides per-scenario models)
npm run eval:smoke:run -- --model anthropic/claude-sonnet-4.5Location: evals/e2e/ + evals/cline-bench/
Full agent tests on production-grade tasks:
- 12 real-world coding problems
- Docker/Daytona execution via Harbor
- Nightly CI runs
# Prerequisites: Python 3.13, Harbor, Docker
npm run eval:e2e
# Specific task
npm run eval:e2e -- --tasks discord
# Different provider
npm run eval:e2e -- --provider openai --model gpt-4oThe framework calculates:
| Metric | Formula | Interpretation |
|---|---|---|
| pass@k | P(≥1 of k passes) | Solution finding capability |
| pass^k | P(all k pass) | Reliability |
| Flakiness | Entropy of pass rate | Consistency |
With 3 trials:
- All pass →
pass(reliable) - All fail →
fail(broken) - Mixed →
flaky(needs investigation)
- Current PR gate: contract tests only
- Smoke test CI: temporarily disabled while the workflow is repointed at the SDK CLI
- Nightly: E2E tests with cline-bench are not yet implemented, see TODO
# Run all fast tests
npm run test:unit
npm run eval:smoke:run
# Run E2E (requires setup)
cd evals/cline-bench
# Follow README.md for Harbor setup
npm run eval:e2e- Create
evals/smoke-tests/scenarios/<name>/config.json - Add optional
template/directory with starting files - Run to verify:
npm run eval:smoke:run -- --scenario <name>
- Add to
src/core/api/transform/__tests__/ - Run:
npm run test:unit -- --grep "YourTest"
Contribute to cline/cline-bench
- Nightly E2E CI: Add scheduled workflow for cline-bench tests
- Requires: Docker runner, Harbor setup, ~1-2 hour timeout
- Should run on schedule (e.g., nightly) not per-PR
- Separate secrets for E2E environment
- Native tool calling smoke tests: Add CLI support for
native_tool_call_enabledsetting to test Claude 4 with native tools