LangGraph chaos-engineering agent. Plans, runs, and reports on AWS Fault Injection Service experiments — measuring RTO, cascading-failure paths, and (when an Aurora target is wired up) RPO. A small React frontend lets users trigger experiments and review reports.
It's the active counterpart to a reactive SRE/incident-investigation agent: one causes and measures failures, the other investigates incidents. Pair them to measure MTTD + MTTR alongside RTO.
- Lists FIS experiment templates in your AWS account (read-only)
- Captures a CloudWatch baseline before each experiment
- Triggers an FIS experiment, polls state, listens for related CloudWatch alarms
- Detects cascading failures by correlating cross-service metric breaches
- Computes RTO from alarm OK-transition timestamps after the experiment ends
- Generates an HTML report (FIS itself produces the official PDF report into S3 — we link to it)
LangGraph state machine over Bedrock (Claude Opus 4.7 for planning/analysis, Haiku 4.5 for poll loops). FastAPI service exposes REST + SSE under /api; the React+Vite frontend consumes them. State is checkpointed to SQLite locally and to DynamoDB when deployed (STORE_BACKEND); reports render to local disk and optionally to S3 (REPORT_BUCKET). The same container runs locally and on ECS Fargate behind an ALB — see Deploy to AWS.
flowchart LR
User([SRE]) --> UI[React UI]
UI <--> API[FastAPI<br/>REST + SSE]
API --> LG[LangGraph<br/>plan → preflight → baseline → APPROVE<br/>→ inject → observe → recover → analyze → report]
LG -. invokeModel .-> Bedrock[Bedrock<br/>Opus 4.7 + Haiku 4.5]
LG --> FIS[AWS FIS]
LG --> CW[CloudWatch]
LG -. RPO .-> RDS[Aurora]
FIS -. PDF .-> S3[(S3)]
FIS --> Target[Target under test]
Target --> CW
API <--> Store[(State store<br/>SQLite → DynamoDB)]
See docs/architecture.md and docs/architecture.drawio for full design (Mermaid + drawio diagrams included). Customer-facing pitch in docs/pitch.md.
# 1. Create a venv and install (Python 3.11+; uv works well)
uv venv --python python3.11 .venv
source .venv/bin/activate
uv pip install -e ".[dev]"
# 2. Configure
cp .env.example .env
# edit .env to point at your AWS profile / region
# 3. Confirm AWS access (read-only)
python -m agent.tools.fis_tools list-templates
# 4. Run the API
uvicorn api.main:app --reload --port 8787
# 5. Run the frontend
cd frontend && npm install && npm run dev
# open http://localhost:5173The agent ships as a single container (multi-stage infra/Dockerfile builds the React
SPA and serves it from the FastAPI app). It runs on ECS Fargate behind an ALB, with
DynamoDB for run state and a private S3 bucket for reports. Two equivalent IaC
options are provided — pick one:
| Path | Quickstart | |
|---|---|---|
| Terraform | infra/terraform/ |
terraform init && terraform apply, then build/push the image |
| AWS CDK | infra/cdk/ |
npm install && npx cdk deploy (builds the image as a CDK asset) |
Both create a least-privilege task role from infra/iam-policy.json:
FIS start/stop + read, CloudWatch read, RDS read (for RPO), Bedrock invoke, and scoped
DynamoDB + S3 — no iam:*, no *Delete*, no infra teardown permissions. Each IaC
folder's README has the full sequence and hardening notes (HTTPS/Cognito, private subnets).
This deploys the agent's control plane only. Point it at an AWS account that already has FIS experiment templates + CloudWatch alarms (see Target infra).
The agent is target-agnostic — point it at any AWS account that already has FIS experiment templates and CloudWatch alarms defined.
For development we use a small Terraform target (EC2 web servers, a Lambda order processor, DynamoDB, SQS, four FIS templates, and CloudWatch alarms named <prefix>-*). The default alarm prefix is sre-chaos- — change DEFAULT_TARGET_ALARMS in agent/tools/cloudwatch_tools.py to match your environment.
EKS + Aurora support (Chaos Mesh, AuroraReplicaLag-based RPO) is coded but feature-flagged off; flip ENABLE_CHAOS_MESH / ENABLE_AURORA_RPO in .env once you have an EKS cluster and Aurora cluster to point at.
- Mutating AWS calls require user confirmation in the UI (the agent's HITL gate before
fis:StartExperiment). - The agent never calls
fis:DeleteExperimentTemplateor any destructive non-FIS API. It only starts/stops experiments and reads telemetry. - All write operations are logged to
.artifacts/audit.log.
skills/chaos-engineering-on-aws/— from aws-samples/sample-aws-resilience-skill (MIT-0)skills/eks-resilience-checker/— same sourceskills/aws-resilience-modeling/— same sourceskills/mastering-langgraph/— from spillwavesolutions/mastering-langgraph-agent-skill (MIT)
These are loaded as system-prompt context per LangGraph node so the agent's planning, pre-flight checks, and analysis follow established patterns instead of being re-invented.