GitHub - Ethara-Ai/kraken: Kraken is a reinforcement learning environment built on the SWE-fficiency framework for real-world performance optimization with verifiable speedup rewards.

An RL environment for training LLM coding agents on real-world performance optimization

Kraken is a repository-level RL environment for training LLM coding agents on performance optimization. Each task provides an agent with a full codebase snapshot, a targeted performance workload to speed up, and a set of correctness tests that must remain green. The agent receives a reward based on the Harmonic Speedup Ratio (HSR), jointly reflecting correctness and runtime efficiency.

Overview

Kraken frames pass-to-pass performance engineering as an RL problem: start from a codebase and a slow workload, improve runtime, and don't break behavior. The focus is on investigation (profiling/localization) and correctness-preserving edits — mirroring how performance engineers work day-to-day.

Unlike traditional coding tasks that measure only functional correctness, Kraken jointly rewards correctness and efficiency. An agent that breaks tests receives zero reward, regardless of speed.

Key Features

Performance-Aware Reward — Jointly scores functional correctness and runtime speedup via HSR. Patches that break tests score zero.
Docker-Isolated Environment — Every run uses a prebuilt container with CPU/memory pinning for reproducible training runs.
Flexible Agent Support — Works with OpenHands, SWE-agent, Cursor CLI, or any agent that produces git patches.
Full Pipeline Automation — run_pipeline.sh handles the entire workflow from data collection to reward computation.
Rich Analysis Toolkit — Scripts for flamegraph profiling, workload analysis, difficulty classification, and model comparison.
Extensible — Add new repositories via auto-detect pipeline with version discovery and Docker image building.

Installation

Requires Python 3.8+. Linux host recommended.

git clone https://github.com/Ethara-Ai/kraken.git
cd kraken

# Using uv (recommended)
uv venv --python 3.12
source .venv/bin/activate
uv sync

# Or using pip
pip install -e .

Quick Start

1. Run the gold baseline

Establishes reference performance using expert (human) patches:

swefficiency eval --run_id my_run --num_workers 12

Results stored in logs/run_evaluation/my_run/gold/.

2. Run an agent

swefficiency eval --run_id my_run --num_workers 12 --prediction_path agent_predictions.jsonl

Prediction format (JSONL):

{"instance_id": "<id>", "model_patch": "<patch_text>", "model_name_or_path": "<agent_name>"}

Results stored in logs/run_evaluation/my_run/<agent_name>/.

3. Score the agent

swefficiency report \
    --gold_run logs/run_evaluation/my_run/gold \
    --pred_run logs/run_evaluation/my_run/<agent_name>

Outputs eval_reports/eval_report_<agent_name>.csv (per-instance results) and .json (summary metrics including HSR).

Reproducibility Setup

For reproducible training runs, use a dedicated machine (GCP n2-standard-64 recommended) with Docker CPU pinning:

bash scripts/vm/setup_vm.sh
sudo scripts/vm/setup_docker.sh MEM_MAX MEM_HIGH

Use --num_workers 12 for 4 vCPUs / 16 GB RAM per worker.

CLI Reference

`swefficiency eval`

Flag	Default	Description
`--num_workers`	`4`	Parallel workers
`--run_id`	auto-generated	Run identifier for output directories
`--dataset`	`swefficiency/swefficiency`	HuggingFace dataset or local JSONL path
`--prediction_path`	—	Path to agent predictions JSONL (omit for gold baseline)
`--instances_regex`	—	Filter instances by regex pattern (e.g. `"numpy.*"`)
`--force_rerun`	`false`	Re-run even if cached results exist

`swefficiency report`

Flag	Description
`--gold_run` (required)	Path to gold run directory
`--pred_run` (required)	Path to agent prediction run directory
`--report_output`	Output directory (default: `eval_reports/`)
`--num_workers`	Parallel workers (default: `4`)

Agent Integration

Kraken provides a Docker-based inference harness at scripts/inference/custom.py for running agents against environment tasks.

OpenHands

python scripts/inference/custom.py \
  --run-id openhands_run \
  --spec scripts/inference/specs/openhands_agent.yaml \
  --num-workers 4 \
  --max-instances 10

Cursor CLI

python scripts/inference/custom.py \
  --run-id cursor_run \
  --spec scripts/inference/specs/cursor_cli.yaml \
  --num-workers 4 \
  --var cursor_cli_args="--max-steps 75"

Each instance produces a git patch at logs/run_inference/<run_id>/<instance_id>/patch.diff, ready for reward computation via swefficiency eval.

Pipeline

The run_pipeline.sh orchestrator automates the full workflow:

Scrape PRs → Filter Performance PRs → Version Detection → Detect Specs
→ Workload Generation → Assemble Dataset → Build Docker → Agent Run
→ Score Patches → Generate Report

# Full pipeline from scratch
./run_pipeline.sh --repo owner/repo --run-id my_run

# Use existing dataset (skips stages 1-6)
./run_pipeline.sh --dataset artifacts/final/dataset.jsonl --run-id my_run --stages eval,pred_eval,report

# With agent inference
./run_pipeline.sh --dataset artifacts/final/dataset.jsonl --run-id my_run --mode openhands

Project Structure

.
├── pyproject.toml              # Package configuration
├── run_pipeline.sh             # Pipeline orchestrator
├── swefficiency/               # Python package
│   ├── cli.py                  # CLI entrypoint
│   ├── report.py               # Report generation (HSR scoring)
│   ├── harness/                # Docker-based environment
│   ├── collect/                # Dataset collection
│   ├── versioning/             # Version detection
│   ├── perf_filter/            # Performance PR filtering
│   └── workload/               # Workload generation
├── scripts/
│   ├── eval/                   # Scoring scripts
│   ├── inference/              # Agent harness
│   ├── perf/                   # Performance analysis
│   ├── vm/                     # Docker/VM setup
│   └── slurm/                  # Cluster support
├── analysis/                   # Research scripts + plots
├── tests/                      # Test suite
└── docs/                       # Documentation + figures

Contributing

See CONTRIBUTING.md for guidelines. This codebase began as a fork from SWE-Gym's SWE-Bench fork and extends the pipeline with performance-specific commit filtering, workload evaluation, and additional training tooling.

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

This is not an officially supported Google product.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
analysis		analysis
docs		docs
scripts		scripts
swefficiency		swefficiency
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
run_pipeline.sh		run_pipeline.sh
run_pipeline_cpp.sh		run_pipeline_cpp.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Overview

Key Features

Installation

Quick Start

1. Run the gold baseline

2. Run an agent

3. Score the agent

Reproducibility Setup

CLI Reference

`swefficiency eval`

`swefficiency report`

Agent Integration

OpenHands

Cursor CLI

Pipeline

Project Structure

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Overview

Key Features

Installation

Quick Start

1. Run the gold baseline

2. Run an agent

3. Score the agent

Reproducibility Setup

CLI Reference

swefficiency eval

swefficiency report

Agent Integration

OpenHands

Cursor CLI

Pipeline

Project Structure

Contributing

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`swefficiency eval`

`swefficiency report`

Packages