cve-env

Agentic CVE → Docker environment builder.

Given a CVE ID, cve-env builds a Docker environment running the affected application at its pre-patch version and verifies the build is correct. The agent researches each CVE live (NVD + OSV + GitHub + container registries) and picks its own build path.

Version: 1.0 beta Author: Gadi Evron (@gadievron)

What `success` means

status="success" requires both:

Right version — a version-assertion exec_check (pip show, dpkg -l, openssl version, find / -name '*.jar', …) proving the deployed binaries fall in the CVE's affected range.
Working app — functional-smoke checks proving normal operation on benign input (e.g. HTTP GET / + a content match, or a SELECT 1 roundtrip for a database).

If verify runs but that evidence is incomplete, the runtime records the honest fallback verified_partial instead. cve-env's job ends at a correct, working environment — it builds and verifies; it does not assess or attempt the vulnerability.

Every build runs in a hardened container: --cap-drop ALL, --security-opt no-new-privileges:true, ports bound to 127.0.0.1 only.

Requirements

Python 3.12+ and uv
A Docker daemon reachable via docker on PATH. Developed and tested against Colima; any Docker-compatible daemon (Docker Desktop, OrbStack, rootless Docker) should work but is untested
The claude CLI, logged in — cve-env authenticates through your existing Claude Code session via claude-agent-sdk. No ANTHROPIC_API_KEY is consumed.

Install

git clone <this-repo> && cd cve-env
uv sync

Quick start

# Build + verify one CVE (prints a per-stage report at the end)
uv run cve-env build CVE-2014-0160

# Probe external services (NVD, OSV, GitHub, Docker Hub, registries)
uv run cve-env doctor

Per-CVE audit traces (every tool call, result, and cost) are written to output/agentic/<run-id>/<cve>.jsonl.

How it works

The agent picks each tool from the CVE record and prior results — there's no fixed script. Work moves through five stages (the stage of each tool is defined by config.TOOL_TO_STAGE):

Stage	Goal	Tools
RESEARCH	Ground the CVE — product, affected version, source repo	`nvd_lookup` (OSV.dev fallback), `github_fetch`
RESOLVE	Find a pre-built image to pull	`image_resolve` (registry cascade)
ACQUIRE	Build an image when no pre-built one fits	`dockerfile_gen`, `docker_build`, `source_build`
LAUNCH	Start the environment	`docker_run`, `docker_compose_up`, `run_in_container`
VERIFY	Prove it's the right version and works	`verify` (7 check types); `give_up` ends the run

RESOLVE and ACQUIRE are alternatives — pull an existing image, or build one. The decision cascade:

Research → product + affected version + (for OSS) the source repo.
Resolve: try to pull a pre-built image. image_resolve walks a registry cascade — Docker Hub → mirror.gcr.io → quay.io → ghcr.io → mcr.microsoft.com — with transport-class retry and an architecture check. If one fits → go to launch.
Acquire (only when no pre-built image fits): build one —
- source_build — clone the OSS repo at the pre-patch tag/commit and build it;
- dockerfile_gen → docker_build — synthesize a Dockerfile (library / language CVEs);
- plugin overlay — pull a clean host image (e.g. wordpress:5.6), then copy_ops the vulnerable plugin onto it;
- forge cascade — fetch code hosted off GitHub (WP-SVN / OSDN / SourceForge).
Launch → docker_run (single service) or docker_compose_up (multi-service vulhub stacks).
Verify, then self-heal on failure — retry, pivot registry, pivot resolve→build, or give_up with a reason.

These are cve-env's 11 custom MCP tools; the agent also uses the Claude Code SDK's built-in tools (Bash, Write, Read, …) for staging files, git clone, and direct shell steps — Bash is in fact its most-used tool. Recovery is the agent's, not an orchestrator's: empty nvd_lookup → OSV.dev; rate-limited image_resolve → next registry (or a generic base image + manual install); source_build with no matching tag → build directly from a git clone in the Dockerfile.

Verification — proving the environment works

VERIFY builds evidence in layers, from "is it up" to "is it the right version doing real work." All HTTP/TCP probes hit the container's published port, bound to 127.0.0.1 only — a non-loopback target is rejected, so the agent can't probe the host network.

Layer	Check type(s)	Proves
Readiness	`container_status`, `stability_wait`, `log_check`	the container started and stayed up; an expected startup / health marker appears in `docker logs`
Networking	`http_check`	the service answers on its port — and the body isn't empty (a zero-byte 200 fails)
Actual usage	`http_request_check` (send a request body, match the response), `tcp_probe_check` (raw TCP send/receive — Redis / Postgres / SMTP / SSH / Memcached …), `exec_check` (run a command inside the container, assert stdout + exit code)	the application does real work on benign input
Version proof	`exec_check` (e.g. `openssl version`, `pip show`, `dpkg -l`, `find / -name '*.jar'`)	the deployed binaries fall in the CVE's affected range

Functional smoke is matched to the application type — not just an HTTP ping. A web app gets a page fetch with a content match (and a deliberate 404); a database gets a query roundtrip (SELECT 1, INSERT/SELECT); a cache / wire-protocol service gets a protocol probe (e.g. a Redis PING → +PONG over raw TCP, or redis-cli ping via exec_check); a library gets a trivial-use exec_check. The has_functional_smoke heuristic is the single gate that decides success vs verified_partial (it passes on ≥3 active checks, an http_check with a content match, or http_checks on ≥2 distinct paths).

success requires both a version-assertion and functional smoke. If the agent's plan is missing functional smoke, the runtime nudges it in real time and — for HTTP services — can auto-inject the missing checks; if the evidence is still incomplete, the honest fallback is verified_partial. (stability_wait auto-bumps to 120s for slow-booting JVM images; container_status is auto-prepended if omitted.)

Outcome statuses — every build writes exactly one:

Status	Meaning
`success`	built + verified (right version and functional smoke)
`verified_partial`	built and verify passed, but the version/smoke evidence was incomplete
`verify_failed`	built and launched, but verification did not pass
`launched_no_verify`	container launched but the run ended before any verify check
`unresolvable`	gave up — no buildable target (proprietary, kernel/firmware, or no image and no source)
`turn_cap`	hit the max-turns cap before converging
`budget_exhausted`	hit the cost cap before converging
`rate_limited`	gave up to external rate-limiting (Docker Hub / NVD / API)
`interrupted`	the run was interrupted before completing
`error`	engine or API failure

Credentials & rate limits

cve-env defaults to anonymous tiers everywhere — no credentials required. Setting any of these raises the corresponding limit and reduces CVEs lost to transient throttling:

Service	Env var(s)	Anonymous	With credential
NVD API	`NVD_API_KEY`	5 req/30s	50 req/30s
GitHub API	`GITHUB_TOKEN`	60 req/hr	5,000 req/hr
Docker Hub	`DOCKER_USERNAME` + `DOCKER_PASSWORD`	100 pulls/6h	200 pulls/6h

Set them via .env (copy .env.example) or your shell profile. The GitHub token is resolved in order: GITHUB_TOKEN env → gh auth token (if the GitHub CLI is installed and logged in) → anonymous; an existing docker login session is reused likewise. The gh CLI is optional — only a convenient token source, not a dependency. Tokens are sent as request headers (never in URLs) and are redacted from audit logs.

Configuration

Settings resolve by precedence: CLI flag → env var (CVE_ENV_<UPPER_SNAKE>) → cve-env.toml → built-in default.

Per-CVE caps — set per run via CLI flags --max-turns, --max-cost-usd, --turn-extension-pct, --max-turn-extensions (built-in defaults: 24 turns, $0.60 soft cost, +20% × 2 extensions). The cost-extension knobs also have env vars (CVE_ENV_MAX_COST_EXTENSIONS, CVE_ENV_COST_EXTENSION_PCT); model via CVE_ENV_MODEL.
Per-stage soft budgets — cve-env.toml [budget] block (copy cve-env.toml.example).
Behavior / safety knobs — CVE_ENV_DISALLOWED_TOOLS (disable built-in agent tools), the source_build size caps (CVE_ENV_MAX_TARBALL_BYTES, CVE_ENV_MAX_EXTRACT_BYTES, …), and the lifecycle hooks below.

For the complete surface: cve-env build --help lists every CLI flag; .env.example documents the env vars; cve-env.toml.example shows the TOML keys; config.py holds the defaults.

Lifecycle hooks (opt-in, default off)

Env var	CLI flag	Action (post-build)
`CVE_ENV_AUTO_CLEANUP_CONTAINERS=1`	`--auto-cleanup-containers`	`docker rm -f` this run's labeled containers (concurrency-safe)
`CVE_ENV_AUTO_PRUNE_IMAGES=1`	`--auto-prune-images`	`docker image prune -f` (dangling layers only)
`CVE_ENV_AUTO_STOP_COLIMA=1`	`--auto-stop-colima`	`colima stop` if no other `cve-env build` is running

Defaults are off so iterative use keeps containers and Colima warm.

Design principles

Agentic, not corpus. No CVE → image dictionary; every run researches live and chooses its own path.
Session auth, not API key. claude-agent-sdk uses your Claude Code session; setting_sources=[] + skills=[] keep the agent's context free of your global rules.
Self-healing through retry, not orchestration. Verify failures, rate limits, and empty lookups are recovered by the agent + per-CVE runtime state (cooldowns, arch counters, refusal latches), reset between CVEs.
Correctness gated at runtime. success requires version-assertion and functional-smoke; anything less is verified_partial. A false-positive success is not reachable by construction.
Every refusal is logged for post-run analysis.

Known limitations (declared, not bugs)

Figures below are from 1,838 benched runs across 33 benches on the dev corpus.

Kernel / firmware / hardware / non-Linux CVEs aren't buildable. A Docker container can't be a kernel, a firmware image, or another OS — so CVEs in kernel drivers (e.g. Arm Mali GPU), firmware/BIOS (e.g. coreboot SMM), or non-Linux systems (e.g. FreeBSD) can't be reproduced as an application environment. The engine detects these and gives up cleanly (arch_incompatible).
Architecture is handled; the host platform tested-surface is narrow. cve-env detects the host arch (arm64 / amd64), pulls the native image, uses Rosetta to run amd64 images on Apple-Silicon macOS, and falls back to source-build when no compatible image exists — benches ran on both arm64 and amd64. But it has only ever been run on a macOS host + Colima; Linux/Windows hosts, Windows containers, and non-Colima Docker daemons are untested.
Proprietary-vendor CVEs whose CPE vendor overlaps closed-source products can't be fast-rejected by vendor match; cve-env spends a small research budget, then gives up cleanly (proprietary was the dominant give-up in benches).
Run-to-run variance on borderline CVEs — a few oscillate success ↔ unresolvable from agent-reasoning variance + external state.
Cost — it isn't free. Each build spends Claude tokens. A successful build cost a median ~$1.00 (p90 $1.69, max $2.49 across 602 successes); runs that give up early are cheaper (≈$0.13 median overall). Per-CVE cost is bounded by a configurable cap — monitor it on large runs.
Caps cut off the hardest CVEs. Each CVE has a turn cap and a cost cap; ~7% of runs hit one before building (turn_cap 6.9%, budget_exhausted 0.3%). A CVE that genuinely needs more than its cap won't finish. Both are configurable (--max-turns, --max-cost-usd, env, cve-env.toml).
Multi-step CMS auth+seed flows (e.g. admin-authed plugin SQLi) don't always converge within budget.

Dependencies

The four runtime deps (claude-agent-sdk, pydantic, pyyaml, requests) carry floors set to the versions cve-env was validated against and upper bounds at the next major (next minor for claude-agent-sdk during its 0.x series), so an API drift can't silently break a build. uv.lock pins the full transitive tree; uv sync installs exactly that. Bump a dependency intentionally, then re-validate.

Project structure

The wheel ships only src/cve_env (pyproject.toml → packages = ["src/cve_env"]):

src/cve_env/
├── cli.py                  # `cve-env build <cve> | doctor`
├── config.py               # model, caps, paths, env overrides
├── models.py               # Outcome dataclass + status enum
├── policy.py / validators.py   # P14/P17/P18 build invariants
├── agent/
│   ├── llm.py              # claude-agent-sdk wrapper + retry
│   ├── loop.py             # turn loop, status mapping, audit write
│   ├── prompts.py          # system + user prompt rendering
│   ├── tools.py            # the 11 MCP tool registrations
│   ├── audit.py            # per-CVE JSONL writer (secret-redacting)
│   └── refusals.py         # refusal scanner + writer
├── tools/                  # nvd_lookup, github_fetch, image_resolve, source_build,
│                           # dockerfile_gen, docker_build, docker_run,
│                           # docker_compose_up, run_in_container, verify, web_fetch, arch
├── infra/service_health.py # `cve-env doctor`
└── utils/                  # run, lifecycle, safe_env, dockerfile_hygiene, …

Security posture

cve-env runs LLM-generated commands and fetches CVE research from the live internet — so it's built to contain that, and you should run it accordingly.

Built-in protections (automatic):

Builds and runs in a container — --cap-drop ALL, --security-opt no-new-privileges:true, ports bound to 127.0.0.1 only.
The in-process URL fetcher is SSRF-guarded — scheme allowlist + private/metadata-IP denylist + DNS-rebind re-resolution. Downloaded source archives are size-capped and path-traversal-/symlink-checked on extraction.
Audit logs redact some known secret-token shapes and are written owner-only (dir 0700, files 0600).

Recommended when you run it (operator's responsibility):

Point docker at a non-root, isolated context (e.g. a Colima VM), so a build escape lands in the VM, not your host.
Optional: set CVE_ENV_DISALLOWED_TOOLS=WebFetch,WebSearch to disable the agent's built-in general web tools (extra SSRF-surface reduction, at some loss of research reach).

Disclaimer

Provided under the MIT License (see LICENSE) with no warranty. Not fully validated; outputs may be incomplete or inaccurate, and costs must be monitored closely. Do not deploy in production — use only for defensive research in lab environments.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cve-env

What `success` means

Requirements

Install

Quick start

How it works

Verification — proving the environment works

Credentials & rate limits

Configuration

Lifecycle hooks (opt-in, default off)

Design principles

Known limitations (declared, not bugs)

Dependencies

Project structure

Security posture

Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src/cve_env		src/cve_env
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cve-env.toml.example		cve-env.toml.example
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

cve-env

What success means

Requirements

Install

Quick start

How it works

Verification — proving the environment works

Credentials & rate limits

Configuration

Lifecycle hooks (opt-in, default off)

Design principles

Known limitations (declared, not bugs)

Dependencies

Project structure

Security posture

Disclaimer

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

What `success` means

Packages