chaos-agent

LangGraph chaos-engineering agent. Plans, runs, and reports on AWS Fault Injection Service experiments — measuring RTO, cascading-failure paths, and (when an Aurora target is wired up) RPO. A small React frontend lets users trigger experiments and review reports.

It's the active counterpart to a reactive SRE/incident-investigation agent: one causes and measures failures, the other investigates incidents. Pair them to measure MTTD + MTTR alongside RTO.

What it does today

Lists FIS experiment templates in your AWS account (read-only)
Captures a CloudWatch baseline before each experiment
Triggers an FIS experiment, polls state, listens for related CloudWatch alarms
Detects cascading failures by correlating cross-service metric breaches
Computes RTO from alarm OK-transition timestamps after the experiment ends
Generates an HTML report (FIS itself produces the official PDF report into S3 — we link to it)

Architecture

LangGraph state machine over Bedrock (Claude Opus 4.7 for planning/analysis, Haiku 4.5 for poll loops). FastAPI service exposes REST + SSE under /api; the React+Vite frontend consumes them. State is checkpointed to SQLite locally and to DynamoDB when deployed (STORE_BACKEND); reports render to local disk and optionally to S3 (REPORT_BUCKET). The same container runs locally and on ECS Fargate behind an ALB — see Deploy to AWS.

flowchart LR
    User([SRE]) --> UI[React UI]
    UI <--> API[FastAPI<br/>REST + SSE]
    API --> LG[LangGraph<br/>plan → preflight → baseline → APPROVE<br/>→ inject → observe → recover → analyze → report]
    LG -. invokeModel .-> Bedrock[Bedrock<br/>Opus 4.7 + Haiku 4.5]
    LG --> FIS[AWS FIS]
    LG --> CW[CloudWatch]
    LG -. RPO .-> RDS[Aurora]
    FIS -. PDF .-> S3[(S3)]
    FIS --> Target[Target under test]
    Target --> CW
    API <--> Store[(State store<br/>SQLite → DynamoDB)]

See docs/architecture.md and docs/architecture.drawio for full design (Mermaid + drawio diagrams included). Customer-facing pitch in docs/pitch.md.

Quickstart (local)

# 1. Create a venv and install (Python 3.11+; uv works well)
uv venv --python python3.11 .venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# 2. Configure
cp .env.example .env
# edit .env to point at your AWS profile / region

# 3. Confirm AWS access (read-only)
python -m agent.tools.fis_tools list-templates

# 4. Run the API
uvicorn api.main:app --reload --port 8787

# 5. Run the frontend
cd frontend && npm install && npm run dev
# open http://localhost:5173

Deploy to AWS

The agent ships as a single container (multi-stage infra/Dockerfile builds the React SPA and serves it from the FastAPI app). It runs on ECS Fargate behind an ALB, with DynamoDB for run state and a private S3 bucket for reports. Two equivalent IaC options are provided — pick one:

	Path	Quickstart
Terraform	`infra/terraform/`	`terraform init && terraform apply`, then build/push the image
AWS CDK	`infra/cdk/`	`npm install && npx cdk deploy` (builds the image as a CDK asset)

Both create a least-privilege task role from infra/iam-policy.json: FIS start/stop + read, CloudWatch read, RDS read (for RPO), Bedrock invoke, and scoped DynamoDB + S3 — no iam:*, no *Delete*, no infra teardown permissions. Each IaC folder's README has the full sequence and hardening notes (HTTPS/Cognito, private subnets).

This deploys the agent's control plane only. Point it at an AWS account that already has FIS experiment templates + CloudWatch alarms (see Target infra).

Target infra

The agent is target-agnostic — point it at any AWS account that already has FIS experiment templates and CloudWatch alarms defined.

For development we use a small Terraform target (EC2 web servers, a Lambda order processor, DynamoDB, SQS, four FIS templates, and CloudWatch alarms named <prefix>-*). The default alarm prefix is sre-chaos- — change DEFAULT_TARGET_ALARMS in agent/tools/cloudwatch_tools.py to match your environment.

EKS + Aurora support (Chaos Mesh, AuroraReplicaLag-based RPO) is coded but feature-flagged off; flip ENABLE_CHAOS_MESH / ENABLE_AURORA_RPO in .env once you have an EKS cluster and Aurora cluster to point at.

Safety

Mutating AWS calls require user confirmation in the UI (the agent's HITL gate before fis:StartExperiment).
The agent never calls fis:DeleteExperimentTemplate or any destructive non-FIS API. It only starts/stops experiments and reads telemetry.
All write operations are logged to .artifacts/audit.log.

Credits / vendored skills

skills/chaos-engineering-on-aws/ — from aws-samples/sample-aws-resilience-skill (MIT-0)
skills/eks-resilience-checker/ — same source
skills/aws-resilience-modeling/ — same source
skills/mastering-langgraph/ — from spillwavesolutions/mastering-langgraph-agent-skill (MIT)

These are loaded as system-prompt context per LangGraph node so the agent's planning, pre-flight checks, and analysis follow established patterns instead of being re-invented.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agent		agent
api		api
docs		docs
frontend		frontend
infra		infra
skills		skills
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.mcp.json		.mcp.json
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

chaos-agent

What it does today

Architecture

Quickstart (local)

Deploy to AWS

Target infra

Safety

Credits / vendored skills

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

chaos-agent

What it does today

Architecture

Quickstart (local)

Deploy to AWS

Target infra

Safety

Credits / vendored skills

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages