Kraken is a repository-level RL environment for training LLM coding agents on performance optimization. Each task provides an agent with a full codebase snapshot, a targeted performance workload to speed up, and a set of correctness tests that must remain green. The agent receives a reward based on the Harmonic Speedup Ratio (HSR), jointly reflecting correctness and runtime efficiency.
- Overview
- Installation
- Quick Start
- CLI Reference
- Agent Integration
- Pipeline
- Project Structure
- Contributing
- License
Kraken frames pass-to-pass performance engineering as an RL problem: start from a codebase and a slow workload, improve runtime, and don't break behavior. The focus is on investigation (profiling/localization) and correctness-preserving edits — mirroring how performance engineers work day-to-day.
Unlike traditional coding tasks that measure only functional correctness, Kraken jointly rewards correctness and efficiency. An agent that breaks tests receives zero reward, regardless of speed.
- Performance-Aware Reward — Jointly scores functional correctness and runtime speedup via HSR. Patches that break tests score zero.
- Docker-Isolated Environment — Every run uses a prebuilt container with CPU/memory pinning for reproducible training runs.
- Flexible Agent Support — Works with OpenHands, SWE-agent, Cursor CLI, or any agent that produces git patches.
- Full Pipeline Automation —
run_pipeline.shhandles the entire workflow from data collection to reward computation. - Rich Analysis Toolkit — Scripts for flamegraph profiling, workload analysis, difficulty classification, and model comparison.
- Extensible — Add new repositories via auto-detect pipeline with version discovery and Docker image building.
Requires Python 3.8+. Linux host recommended.
git clone https://github.com/Ethara-Ai/kraken.git
cd kraken
# Using uv (recommended)
uv venv --python 3.12
source .venv/bin/activate
uv sync
# Or using pip
pip install -e .Establishes reference performance using expert (human) patches:
swefficiency eval --run_id my_run --num_workers 12Results stored in logs/run_evaluation/my_run/gold/.
swefficiency eval --run_id my_run --num_workers 12 --prediction_path agent_predictions.jsonlPrediction format (JSONL):
{"instance_id": "<id>", "model_patch": "<patch_text>", "model_name_or_path": "<agent_name>"}Results stored in logs/run_evaluation/my_run/<agent_name>/.
swefficiency report \
--gold_run logs/run_evaluation/my_run/gold \
--pred_run logs/run_evaluation/my_run/<agent_name>Outputs eval_reports/eval_report_<agent_name>.csv (per-instance results) and .json (summary metrics including HSR).
For reproducible training runs, use a dedicated machine (GCP n2-standard-64 recommended) with Docker CPU pinning:
bash scripts/vm/setup_vm.sh
sudo scripts/vm/setup_docker.sh MEM_MAX MEM_HIGHUse --num_workers 12 for 4 vCPUs / 16 GB RAM per worker.
| Flag | Default | Description |
|---|---|---|
--num_workers |
4 |
Parallel workers |
--run_id |
auto-generated | Run identifier for output directories |
--dataset |
swefficiency/swefficiency |
HuggingFace dataset or local JSONL path |
--prediction_path |
— | Path to agent predictions JSONL (omit for gold baseline) |
--instances_regex |
— | Filter instances by regex pattern (e.g. "numpy.*") |
--force_rerun |
false |
Re-run even if cached results exist |
| Flag | Description |
|---|---|
--gold_run (required) |
Path to gold run directory |
--pred_run (required) |
Path to agent prediction run directory |
--report_output |
Output directory (default: eval_reports/) |
--num_workers |
Parallel workers (default: 4) |
Kraken provides a Docker-based inference harness at scripts/inference/custom.py for running agents against environment tasks.
python scripts/inference/custom.py \
--run-id openhands_run \
--spec scripts/inference/specs/openhands_agent.yaml \
--num-workers 4 \
--max-instances 10python scripts/inference/custom.py \
--run-id cursor_run \
--spec scripts/inference/specs/cursor_cli.yaml \
--num-workers 4 \
--var cursor_cli_args="--max-steps 75"Each instance produces a git patch at logs/run_inference/<run_id>/<instance_id>/patch.diff, ready for reward computation via swefficiency eval.
The run_pipeline.sh orchestrator automates the full workflow:
Scrape PRs → Filter Performance PRs → Version Detection → Detect Specs
→ Workload Generation → Assemble Dataset → Build Docker → Agent Run
→ Score Patches → Generate Report
# Full pipeline from scratch
./run_pipeline.sh --repo owner/repo --run-id my_run
# Use existing dataset (skips stages 1-6)
./run_pipeline.sh --dataset artifacts/final/dataset.jsonl --run-id my_run --stages eval,pred_eval,report
# With agent inference
./run_pipeline.sh --dataset artifacts/final/dataset.jsonl --run-id my_run --mode openhands.
├── pyproject.toml # Package configuration
├── run_pipeline.sh # Pipeline orchestrator
├── swefficiency/ # Python package
│ ├── cli.py # CLI entrypoint
│ ├── report.py # Report generation (HSR scoring)
│ ├── harness/ # Docker-based environment
│ ├── collect/ # Dataset collection
│ ├── versioning/ # Version detection
│ ├── perf_filter/ # Performance PR filtering
│ └── workload/ # Workload generation
├── scripts/
│ ├── eval/ # Scoring scripts
│ ├── inference/ # Agent harness
│ ├── perf/ # Performance analysis
│ ├── vm/ # Docker/VM setup
│ └── slurm/ # Cluster support
├── analysis/ # Research scripts + plots
├── tests/ # Test suite
└── docs/ # Documentation + figures
See CONTRIBUTING.md for guidelines. This codebase began as a fork from SWE-Gym's SWE-Bench fork and extends the pipeline with performance-specific commit filtering, workload evaluation, and additional training tooling.
Copyright 2026 Google LLC
Licensed under the Apache License, Version 2.0. See LICENSE for details.
This is not an officially supported Google product.