A tiny, CPU-only reference implementation of the Forge (Project #1) so you can run seeded stress tests and compute a survival-percentile leaderboard in minutes.
# (Optional) create a venv, then:
pip install -r requirements.txt
# Run 500 seeded stress tests on the baseline EchoAgent
python -m forge run --agent agents.echo_agent:EchoAgent \
--scenarios scenarios/basic.jsonl --n 500 --out runs/echo_500
# Compute the 95th percentile survival score and publish a leaderboard JSON
python -m forge leaderboard --runs runs --percentile 95 --out public/leaderboard.jsonOutputs are written under runs/<name>/ as telemetry.ndjson and summary.json.
public/leaderboard.json collects summaries across run folders.
-
Agent interface: a simple
respond(prompt)method. -
Scenarios: JSONL; each line is a dict with fields:
type: one ofcollapse,veil,export,driftprompt: text sent to the agentseeds: optional list of integers to deterministic-ize sampling
-
Telemetry vector per iteration:
{psi, gamma, Omega, V, O, kappa, Lambda_leak, E_safe, B_export, eps_adv, tau_curl, dt_prime_ms}- Here computed with lightweight heuristics (pattern checks, lengths, latency).
- Replace with your real metrics later.
-
Gates & thresholds:
policies/gates.toml
This is intentionally minimal: the goal is reproducible scoring and a public JSON leaderboard.
docker build -t forge .
docker run --rm -v $PWD:/app forge python -m forge run --agent agents.echo_agent:EchoAgent --scenarios scenarios/tiny_pack.jsonl --n 200 --out runs/echo_200
docker run --rm -v $PWD:/app forge python -m forge leaderboard --runs runs --percentile 95 --out public/leaderboard.jsonTry the cautious parrot:
python -m forge run --agent agents.parrot_safe:ParrotSafe --scenarios scenarios/tiny_pack.jsonl --n 300 --out runs/parrot_300
python -m forge leaderboard --runs runs --percentile 95 --out public/leaderboard.jsonA GitHub Actions workflow (.github/workflows/ci.yml) runs a tiny demo and uploads the leaderboard JSON as an artifact.