A GPU experiment orchestrator for ML research.
Orze runs experiments on GPUs: schedule ideas → train → evaluate → report → repeat. It coordinates GPUs via filesystem locks, works across machines, and gives you a complete leaderboard, notifications, and analysis — out of the box.
Website: orze.ai
curl -sL https://orze.ai/install | bashThat's it. It installs orze, detects your GPUs and codebase, generates training scripts and experiment ideas, and starts running — all in one command.
Pass environment variables for additional options:
# LLM-powered setup
ANTHROPIC_API_KEY=sk-ant-... curl -sL https://orze.ai/install | bash
# With pro (autopilot)
ORZE_PRO_KEY=ORZE-PRO-xxx curl -sL https://orze.ai/install | bash
# Custom project path
curl -sL https://orze.ai/install | bash -s /nfs/my-projectorze is a complete, production-ready tool. orze-pro adds autopilot — so experiments run while you sleep.
| Feature | orze (free) | + orze-pro |
|---|---|---|
| GPU scheduling & multi-node | ✓ | ✓ |
| Idea queue (ideas.md + SQLite) | ✓ | ✓ |
| Hyperparameter sweep (auto-expand grid) | ✓ | ✓ |
| Leaderboard report | ✓ | ✓ |
| Notifications (Telegram/Slack) | ✓ | ✓ |
| Admin dashboard & MCP server | ✓ | ✓ |
| Retrospection (plateau detection) | ✓ | ✓ |
| Cross-experiment regression analysis | ✓ | ✓ |
| Failure analysis & categorization | ✓ | ✓ |
| Checkpoint GC | ✓ | ✓ |
| Sealed eval protection | ✓ | ✓ |
| Service watchdog (auto-restart + containers) | ✓ | ✓ |
| Autonomous research agents (Gemini/GPT/Claude) | ✓ | |
| The Professor (paper lake, cross-domain search, strategy) | ✓ | |
| Engineer (implement ideas, fix bugs) | ✓ | |
| Auto-fix failed experiments | ✓ | |
| Code evolution on plateau | ✓ | |
| Meta-research (strategy adjustment) | ✓ | |
| FSM orchestration (7 procedures) | ✓ | |
| Data analyst & thinker (auto-injected) | ✓ |
| orze free | + orze-pro | |
|---|---|---|
| How ideas are generated | Smart Suggestions — rule-based: detects regressions, generates scale sweeps, perturbations | Research Agents — LLM-driven: reads all results, forms hypotheses, designs novel experiments |
| How failures are handled | You read the failure log | Auto-fix: LLM diagnoses and patches the error |
| How plateaus are handled | Smart Suggestions tries parameter variations | Code Evolution: LLM modifies your train script |
| Does research stop? | Never — Smart Suggestions keeps GPUs busy | Never — agents run indefinitely |
| Requires API key? | No | Yes (Gemini/OpenAI/Anthropic) |
| orze | orze-pro | Notes |
|---|---|---|
| 4.1.x | 0.8.x | Current release |
After install, orze auto-detects GPUs and starts running experiments.
AI CLI users (Claude Code, Cursor, Codex):
do @ORZE-AGENT.md# Project lifecycle
orze init [path] # initialize a new project
orze start # start as background daemon
orze stop # stop gracefully
orze restart # stop + start
orze --check # validate config, files, GPUs, API keys
orze --uninstall # full cleanup, preserves research results
# Operations
orze upgrade # reinstall from source + restart daemon
orze admin migrate # migrate legacy layout to .orze/
orze service install # auto-restart on crash (systemd)
# Pro
orze pro activate <key> # activate license
orze pro status # check license info
orze pro deactivate # remove license
orze sop list # list available SOPsyour-project/
├── orze.yaml # Project config (single source of truth)
├── train.py # Your training script
├── ideas.md # Experiment queue
├── GOAL.md # Research objective
├── RESEARCH_RULES.md # Agent constraints
├── configs/base.yaml # Default hyperparameters
├── .env # API keys (gitignored)
├── ORZE-AGENT.md # AI CLI instructions
├── ORZE-RULES.md # Agent guardrails
├── venv/ # Training dependencies
├── .orze/ # Runtime state (gitignored)
│ ├── state/version.json # Layout version
│ ├── logs/ # Role logs
│ ├── locks/ # Filesystem locks
│ ├── rules/ # Migrated rule files
│ ├── mcp/ # MCP server configs
│ ├── receipts/ # Execution evidence
│ ├── triggers/ # One-shot role triggers
│ ├── heartbeats/ # Per-host liveness
│ ├── backups/ # ideas.md backups
│ └── feedback/ # Failure feedback
├── procedures/ # User procedure overrides (pro)
├── fsm/runner.py # FSM orchestrator (pro)
└── orze_results/ # Research outputs
├── idea-0001/metrics.json
├── methods/ # Generated code
└── knowledge/ # Analysis insights
Start orze in the same shared folder on any machine — nodes auto-join the research pool.
# Node 1
ssh node1 "cd /nfs/project && orze start"
# Node 2
ssh node2 "cd /nfs/project && orze start"- Scales to 1M+ Experiments — SQLite-backed job queue with O(log N) scheduling
- Config Inheritance — Child ideas inherit parent configs; specify only what changes
- HP Sweep —
lr: [1e-4, 3e-4]auto-expands into all combinations - Failure Protection — Stops automatically when failure rates spike
- Cross-Experiment Analysis — Detects regressions, tradeoffs, and suggests actions
- Rich Notifications — GPU VRAM, per-dataset breakdown, verified results, target/gap tracking
- Admin Panel — Real-time web dashboard at
http://localhost:8787 - Clean Uninstall —
orze --uninstallremoves runtime files, preserves results
Your training script receives:
python train.py --idea-id idea-001 --results-dir orze_results --ideas-md ideas.md --config base.yamlRequired output: orze_results/{idea_id}/metrics.json:
{"status": "COMPLETED", "test_accuracy": 0.92, "training_time": 142.5}See SKILL.md for the full technical specification.
Auto-launches at http://localhost:8787. No extra install needed.
notifications:
enabled: true
on: [completed, failed, new_best]
channels:
- type: telegram
bot_token: "YOUR_BOT_TOKEN"
chat_id: "YOUR_CHAT_ID"orze service install -c orze.yaml # auto-restart on crash + manage containers
orze service status # check health
orze service uninstall # removeThe watchdog runs every minute (crontab) or every 5 minutes (systemd). It restarts orze on crash/stall and manages Docker containers defined in orze.yaml:
containers:
paperdog:
image: orzeai/paperdog:latest
ports:
- "8000:8000"Containers are auto-pulled and recreated when a new image is available.
@article{li2026autoresearching,
title={Auto Researching, not hyperparameter tuning: Convergence Analysis of 10,000 Experiments},
author={Li, Xiaoyi},
journal={arXiv preprint arXiv:2603.15916},
year={2026}
}Apache 2.0 — orze is and will always be free and open source.
orze-pro (autopilot features) is commercially licensed.