Agentic CVE → Docker environment builder.
Given a CVE ID, cve-env builds a Docker environment running the affected application at its pre-patch version and verifies the build is correct. The agent researches each CVE live (NVD + OSV + GitHub + container registries) and picks its own build path.
Version: 1.0 beta Author: Gadi Evron (@gadievron)
status="success" requires both:
- Right version — a version-assertion
exec_check(pip show,dpkg -l,openssl version,find / -name '*.jar', …) proving the deployed binaries fall in the CVE's affected range. - Working app — functional-smoke checks proving normal operation on benign input (e.g. HTTP
GET /+ a content match, or aSELECT 1roundtrip for a database).
If verify runs but that evidence is incomplete, the runtime records the honest fallback verified_partial instead. cve-env's job ends at a correct, working environment — it builds and verifies; it does not assess or attempt the vulnerability.
Every build runs in a hardened container: --cap-drop ALL, --security-opt no-new-privileges:true, ports bound to 127.0.0.1 only.
- Python 3.12+ and uv
- A Docker daemon reachable via
dockeronPATH. Developed and tested against Colima; any Docker-compatible daemon (Docker Desktop, OrbStack, rootless Docker) should work but is untested - The
claudeCLI, logged in — cve-env authenticates through your existing Claude Code session viaclaude-agent-sdk. NoANTHROPIC_API_KEYis consumed.
git clone <this-repo> && cd cve-env
uv sync# Build + verify one CVE (prints a per-stage report at the end)
uv run cve-env build CVE-2014-0160
# Probe external services (NVD, OSV, GitHub, Docker Hub, registries)
uv run cve-env doctorPer-CVE audit traces (every tool call, result, and cost) are written to output/agentic/<run-id>/<cve>.jsonl.
The agent picks each tool from the CVE record and prior results — there's no fixed script. Work moves through five stages (the stage of each tool is defined by config.TOOL_TO_STAGE):
| Stage | Goal | Tools |
|---|---|---|
| RESEARCH | Ground the CVE — product, affected version, source repo | nvd_lookup (OSV.dev fallback), github_fetch |
| RESOLVE | Find a pre-built image to pull | image_resolve (registry cascade) |
| ACQUIRE | Build an image when no pre-built one fits | dockerfile_gen, docker_build, source_build |
| LAUNCH | Start the environment | docker_run, docker_compose_up, run_in_container |
| VERIFY | Prove it's the right version and works | verify (7 check types); give_up ends the run |
RESOLVE and ACQUIRE are alternatives — pull an existing image, or build one. The decision cascade:
- Research → product + affected version + (for OSS) the source repo.
- Resolve: try to pull a pre-built image.
image_resolvewalks a registry cascade — Docker Hub →mirror.gcr.io→quay.io→ghcr.io→mcr.microsoft.com— with transport-class retry and an architecture check. If one fits → go to launch. - Acquire (only when no pre-built image fits): build one —
source_build— clone the OSS repo at the pre-patch tag/commit and build it;dockerfile_gen→docker_build— synthesize a Dockerfile (library / language CVEs);- plugin overlay — pull a clean host image (e.g.
wordpress:5.6), thencopy_opsthe vulnerable plugin onto it; - forge cascade — fetch code hosted off GitHub (WP-SVN / OSDN / SourceForge).
- Launch →
docker_run(single service) ordocker_compose_up(multi-service vulhub stacks). - Verify, then self-heal on failure — retry, pivot registry, pivot resolve→build, or
give_upwith a reason.
These are cve-env's 11 custom MCP tools; the agent also uses the Claude Code SDK's built-in tools (Bash, Write, Read, …) for staging files, git clone, and direct shell steps — Bash is in fact its most-used tool. Recovery is the agent's, not an orchestrator's: empty nvd_lookup → OSV.dev; rate-limited image_resolve → next registry (or a generic base image + manual install); source_build with no matching tag → build directly from a git clone in the Dockerfile.
VERIFY builds evidence in layers, from "is it up" to "is it the right version doing real work." All HTTP/TCP probes hit the container's published port, bound to 127.0.0.1 only — a non-loopback target is rejected, so the agent can't probe the host network.
| Layer | Check type(s) | Proves |
|---|---|---|
| Readiness | container_status, stability_wait, log_check |
the container started and stayed up; an expected startup / health marker appears in docker logs |
| Networking | http_check |
the service answers on its port — and the body isn't empty (a zero-byte 200 fails) |
| Actual usage | http_request_check (send a request body, match the response), tcp_probe_check (raw TCP send/receive — Redis / Postgres / SMTP / SSH / Memcached …), exec_check (run a command inside the container, assert stdout + exit code) |
the application does real work on benign input |
| Version proof | exec_check (e.g. openssl version, pip show, dpkg -l, find / -name '*.jar') |
the deployed binaries fall in the CVE's affected range |
Functional smoke is matched to the application type — not just an HTTP ping. A web app gets a page fetch with a content match (and a deliberate 404); a database gets a query roundtrip (SELECT 1, INSERT/SELECT); a cache / wire-protocol service gets a protocol probe (e.g. a Redis PING → +PONG over raw TCP, or redis-cli ping via exec_check); a library gets a trivial-use exec_check. The has_functional_smoke heuristic is the single gate that decides success vs verified_partial (it passes on ≥3 active checks, an http_check with a content match, or http_checks on ≥2 distinct paths).
success requires both a version-assertion and functional smoke. If the agent's plan is missing functional smoke, the runtime nudges it in real time and — for HTTP services — can auto-inject the missing checks; if the evidence is still incomplete, the honest fallback is verified_partial. (stability_wait auto-bumps to 120s for slow-booting JVM images; container_status is auto-prepended if omitted.)
Outcome statuses — every build writes exactly one:
| Status | Meaning |
|---|---|
success |
built + verified (right version and functional smoke) |
verified_partial |
built and verify passed, but the version/smoke evidence was incomplete |
verify_failed |
built and launched, but verification did not pass |
launched_no_verify |
container launched but the run ended before any verify check |
unresolvable |
gave up — no buildable target (proprietary, kernel/firmware, or no image and no source) |
turn_cap |
hit the max-turns cap before converging |
budget_exhausted |
hit the cost cap before converging |
rate_limited |
gave up to external rate-limiting (Docker Hub / NVD / API) |
interrupted |
the run was interrupted before completing |
error |
engine or API failure |
cve-env defaults to anonymous tiers everywhere — no credentials required. Setting any of these raises the corresponding limit and reduces CVEs lost to transient throttling:
| Service | Env var(s) | Anonymous | With credential |
|---|---|---|---|
| NVD API | NVD_API_KEY |
5 req/30s | 50 req/30s |
| GitHub API | GITHUB_TOKEN |
60 req/hr | 5,000 req/hr |
| Docker Hub | DOCKER_USERNAME + DOCKER_PASSWORD |
100 pulls/6h | 200 pulls/6h |
Set them via .env (copy .env.example) or your shell profile. The GitHub token is resolved in order: GITHUB_TOKEN env → gh auth token (if the GitHub CLI is installed and logged in) → anonymous; an existing docker login session is reused likewise. The gh CLI is optional — only a convenient token source, not a dependency. Tokens are sent as request headers (never in URLs) and are redacted from audit logs.
Settings resolve by precedence: CLI flag → env var (CVE_ENV_<UPPER_SNAKE>) → cve-env.toml → built-in default.
- Per-CVE caps — set per run via CLI flags
--max-turns,--max-cost-usd,--turn-extension-pct,--max-turn-extensions(built-in defaults: 24 turns, $0.60 soft cost, +20% × 2 extensions). The cost-extension knobs also have env vars (CVE_ENV_MAX_COST_EXTENSIONS,CVE_ENV_COST_EXTENSION_PCT); model viaCVE_ENV_MODEL. - Per-stage soft budgets —
cve-env.toml[budget]block (copycve-env.toml.example). - Behavior / safety knobs —
CVE_ENV_DISALLOWED_TOOLS(disable built-in agent tools), thesource_buildsize caps (CVE_ENV_MAX_TARBALL_BYTES,CVE_ENV_MAX_EXTRACT_BYTES, …), and the lifecycle hooks below.
For the complete surface: cve-env build --help lists every CLI flag; .env.example documents the env vars; cve-env.toml.example shows the TOML keys; config.py holds the defaults.
| Env var | CLI flag | Action (post-build) |
|---|---|---|
CVE_ENV_AUTO_CLEANUP_CONTAINERS=1 |
--auto-cleanup-containers |
docker rm -f this run's labeled containers (concurrency-safe) |
CVE_ENV_AUTO_PRUNE_IMAGES=1 |
--auto-prune-images |
docker image prune -f (dangling layers only) |
CVE_ENV_AUTO_STOP_COLIMA=1 |
--auto-stop-colima |
colima stop if no other cve-env build is running |
Defaults are off so iterative use keeps containers and Colima warm.
- Agentic, not corpus. No CVE → image dictionary; every run researches live and chooses its own path.
- Session auth, not API key.
claude-agent-sdkuses your Claude Code session;setting_sources=[]+skills=[]keep the agent's context free of your global rules. - Self-healing through retry, not orchestration. Verify failures, rate limits, and empty lookups are recovered by the agent + per-CVE runtime state (cooldowns, arch counters, refusal latches), reset between CVEs.
- Correctness gated at runtime.
successrequires version-assertion and functional-smoke; anything less isverified_partial. A false-positivesuccessis not reachable by construction. - Every refusal is logged for post-run analysis.
Figures below are from 1,838 benched runs across 33 benches on the dev corpus.
- Kernel / firmware / hardware / non-Linux CVEs aren't buildable. A Docker container can't be a kernel, a firmware image, or another OS — so CVEs in kernel drivers (e.g. Arm Mali GPU), firmware/BIOS (e.g. coreboot SMM), or non-Linux systems (e.g. FreeBSD) can't be reproduced as an application environment. The engine detects these and gives up cleanly (
arch_incompatible). - Architecture is handled; the host platform tested-surface is narrow. cve-env detects the host arch (arm64 / amd64), pulls the native image, uses Rosetta to run amd64 images on Apple-Silicon macOS, and falls back to source-build when no compatible image exists — benches ran on both arm64 and amd64. But it has only ever been run on a macOS host + Colima; Linux/Windows hosts, Windows containers, and non-Colima Docker daemons are untested.
- Proprietary-vendor CVEs whose CPE vendor overlaps closed-source products can't be fast-rejected by vendor match; cve-env spends a small research budget, then gives up cleanly (
proprietarywas the dominant give-up in benches). - Run-to-run variance on borderline CVEs — a few oscillate
success ↔ unresolvablefrom agent-reasoning variance + external state. - Cost — it isn't free. Each build spends Claude tokens. A successful build cost a median ~$1.00 (p90 $1.69, max $2.49 across 602 successes); runs that give up early are cheaper (≈$0.13 median overall). Per-CVE cost is bounded by a configurable cap — monitor it on large runs.
- Caps cut off the hardest CVEs. Each CVE has a turn cap and a cost cap; ~7% of runs hit one before building (
turn_cap6.9%,budget_exhausted0.3%). A CVE that genuinely needs more than its cap won't finish. Both are configurable (--max-turns,--max-cost-usd, env,cve-env.toml). - Multi-step CMS auth+seed flows (e.g. admin-authed plugin SQLi) don't always converge within budget.
The four runtime deps (claude-agent-sdk, pydantic, pyyaml, requests) carry floors set to the versions cve-env was validated against and upper bounds at the next major (next minor for claude-agent-sdk during its 0.x series), so an API drift can't silently break a build. uv.lock pins the full transitive tree; uv sync installs exactly that. Bump a dependency intentionally, then re-validate.
The wheel ships only src/cve_env (pyproject.toml → packages = ["src/cve_env"]):
src/cve_env/
├── cli.py # `cve-env build <cve> | doctor`
├── config.py # model, caps, paths, env overrides
├── models.py # Outcome dataclass + status enum
├── policy.py / validators.py # P14/P17/P18 build invariants
├── agent/
│ ├── llm.py # claude-agent-sdk wrapper + retry
│ ├── loop.py # turn loop, status mapping, audit write
│ ├── prompts.py # system + user prompt rendering
│ ├── tools.py # the 11 MCP tool registrations
│ ├── audit.py # per-CVE JSONL writer (secret-redacting)
│ └── refusals.py # refusal scanner + writer
├── tools/ # nvd_lookup, github_fetch, image_resolve, source_build,
│ # dockerfile_gen, docker_build, docker_run,
│ # docker_compose_up, run_in_container, verify, web_fetch, arch
├── infra/service_health.py # `cve-env doctor`
└── utils/ # run, lifecycle, safe_env, dockerfile_hygiene, …
cve-env runs LLM-generated commands and fetches CVE research from the live internet — so it's built to contain that, and you should run it accordingly.
Built-in protections (automatic):
- Builds and runs in a container —
--cap-drop ALL,--security-opt no-new-privileges:true, ports bound to127.0.0.1only. - The in-process URL fetcher is SSRF-guarded — scheme allowlist + private/metadata-IP denylist + DNS-rebind re-resolution. Downloaded source archives are size-capped and path-traversal-/symlink-checked on extraction.
- Audit logs redact some known secret-token shapes and are written owner-only (dir
0700, files0600).
Recommended when you run it (operator's responsibility):
- Point
dockerat a non-root, isolated context (e.g. a Colima VM), so a build escape lands in the VM, not your host. - Optional: set
CVE_ENV_DISALLOWED_TOOLS=WebFetch,WebSearchto disable the agent's built-in general web tools (extra SSRF-surface reduction, at some loss of research reach).
Provided under the MIT License (see LICENSE) with no warranty. Not fully validated; outputs may be incomplete or inaccurate, and costs must be monitored closely. Do not deploy in production — use only for defensive research in lab environments.
MIT — see LICENSE.