Skip to content

1ne/chaos-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

chaos-agent

LangGraph chaos-engineering agent. Plans, runs, and reports on AWS Fault Injection Service experiments — measuring RTO, cascading-failure paths, and (when an Aurora target is wired up) RPO. A small React frontend lets users trigger experiments and review reports.

It's the active counterpart to a reactive SRE/incident-investigation agent: one causes and measures failures, the other investigates incidents. Pair them to measure MTTD + MTTR alongside RTO.

What it does today

  • Lists FIS experiment templates in your AWS account (read-only)
  • Captures a CloudWatch baseline before each experiment
  • Triggers an FIS experiment, polls state, listens for related CloudWatch alarms
  • Detects cascading failures by correlating cross-service metric breaches
  • Computes RTO from alarm OK-transition timestamps after the experiment ends
  • Generates an HTML report (FIS itself produces the official PDF report into S3 — we link to it)

Architecture

LangGraph state machine over Bedrock (Claude Opus 4.7 for planning/analysis, Haiku 4.5 for poll loops). FastAPI service exposes REST + SSE under /api; the React+Vite frontend consumes them. State is checkpointed to SQLite locally and to DynamoDB when deployed (STORE_BACKEND); reports render to local disk and optionally to S3 (REPORT_BUCKET). The same container runs locally and on ECS Fargate behind an ALB — see Deploy to AWS.

flowchart LR
    User([SRE]) --> UI[React UI]
    UI <--> API[FastAPI<br/>REST + SSE]
    API --> LG[LangGraph<br/>plan → preflight → baseline → APPROVE<br/>→ inject → observe → recover → analyze → report]
    LG -. invokeModel .-> Bedrock[Bedrock<br/>Opus 4.7 + Haiku 4.5]
    LG --> FIS[AWS FIS]
    LG --> CW[CloudWatch]
    LG -. RPO .-> RDS[Aurora]
    FIS -. PDF .-> S3[(S3)]
    FIS --> Target[Target under test]
    Target --> CW
    API <--> Store[(State store<br/>SQLite → DynamoDB)]
Loading

See docs/architecture.md and docs/architecture.drawio for full design (Mermaid + drawio diagrams included). Customer-facing pitch in docs/pitch.md.

Quickstart (local)

# 1. Create a venv and install (Python 3.11+; uv works well)
uv venv --python python3.11 .venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# 2. Configure
cp .env.example .env
# edit .env to point at your AWS profile / region

# 3. Confirm AWS access (read-only)
python -m agent.tools.fis_tools list-templates

# 4. Run the API
uvicorn api.main:app --reload --port 8787

# 5. Run the frontend
cd frontend && npm install && npm run dev
# open http://localhost:5173

Deploy to AWS

The agent ships as a single container (multi-stage infra/Dockerfile builds the React SPA and serves it from the FastAPI app). It runs on ECS Fargate behind an ALB, with DynamoDB for run state and a private S3 bucket for reports. Two equivalent IaC options are provided — pick one:

Path Quickstart
Terraform infra/terraform/ terraform init && terraform apply, then build/push the image
AWS CDK infra/cdk/ npm install && npx cdk deploy (builds the image as a CDK asset)

Both create a least-privilege task role from infra/iam-policy.json: FIS start/stop + read, CloudWatch read, RDS read (for RPO), Bedrock invoke, and scoped DynamoDB + S3 — no iam:*, no *Delete*, no infra teardown permissions. Each IaC folder's README has the full sequence and hardening notes (HTTPS/Cognito, private subnets).

This deploys the agent's control plane only. Point it at an AWS account that already has FIS experiment templates + CloudWatch alarms (see Target infra).

Target infra

The agent is target-agnostic — point it at any AWS account that already has FIS experiment templates and CloudWatch alarms defined.

For development we use a small Terraform target (EC2 web servers, a Lambda order processor, DynamoDB, SQS, four FIS templates, and CloudWatch alarms named <prefix>-*). The default alarm prefix is sre-chaos- — change DEFAULT_TARGET_ALARMS in agent/tools/cloudwatch_tools.py to match your environment.

EKS + Aurora support (Chaos Mesh, AuroraReplicaLag-based RPO) is coded but feature-flagged off; flip ENABLE_CHAOS_MESH / ENABLE_AURORA_RPO in .env once you have an EKS cluster and Aurora cluster to point at.

Safety

  • Mutating AWS calls require user confirmation in the UI (the agent's HITL gate before fis:StartExperiment).
  • The agent never calls fis:DeleteExperimentTemplate or any destructive non-FIS API. It only starts/stops experiments and reads telemetry.
  • All write operations are logged to .artifacts/audit.log.

Credits / vendored skills

These are loaded as system-prompt context per LangGraph node so the agent's planning, pre-flight checks, and analysis follow established patterns instead of being re-invented.

About

LangGraph chaos-engineering agent for AWS — plans, runs, and reports on AWS FIS experiments measuring RTO, cascading failures, and RPO.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors