Autonomous SRE agent for AWS. Detects anomalies, investigates incidents, and proposes remediation with continuous learning.
Internet → CloudFront (HTTPS)
│
├── / (frontend) → S3 OAC (private bucket)
│
└── /api/* → VPC Origin ENI → Private ALB → API Handler Lambda
(no public IP) (port 80)
CloudWatch Alarm → EventBridge → Orchestrator Lambda → Step Functions
│
┌───────────────────────┼───────────────┐
│ │ │
Agent Loop Store Report Send Notification
(Bedrock Converse API) (S3 + pgvector) (Slack/PagerDuty/SNS)
│
┌─────────┼──────────┐
│ │ │
CloudWatch X-Ray Logs Insights
Metrics Traces Queries
- Incident Triage: CloudWatch alarm → auto-investigate → remediate → notify
- Proactive Prevention: Scheduled daily/weekly analysis of trends and capacity
- On-Demand Queries: Natural language SRE queries via dashboard or Slack
| Component | Technology |
|---|---|
| Agent Runtime | Bedrock Converse API with native tool_use (Claude Haiku/Sonnet/Opus) |
| Knowledge Base | Aurora Serverless v2 + pgvector (Titan Embeddings v2) |
| Orchestration | AWS Step Functions |
| Frontend | React + Vite + TypeScript, Apple design system |
| Hosting | CloudFront + S3 (OAC) + ALB VPC Origins |
| Auth | Amazon Cognito |
| IaC | AWS CDK (primary) + Terraform (alternative) |
| Integrations | Slack, PagerDuty, Datadog, GitHub (MCP tool registry) |
| Function | Purpose |
|---|---|
| orchestrator | Event classifier + Step Functions trigger |
| agent-loop | Core reasoning engine (Converse API + tool_use cycling) |
| api-handler | REST API for frontend dashboard |
| vector-search | pgvector semantic search + storage |
| tool-dispatcher | Routes tool calls via DynamoDB registry |
| metrics-retrieval | CloudWatch + X-Ray parallel queries |
| log-analysis | CloudWatch Logs Insights |
| remediation | Structured remediation via Converse API |
| notification | Routes to Slack/PagerDuty/SNS by severity |
| proactive-analyzer | Scheduled trend analysis |
| slack-handler | Slack events/commands |
| mcp-tools/datadog | Datadog API integration |
| mcp-tools/pagerduty | PagerDuty API integration |
| mcp-tools/slack | Slack API integration |
| mcp-tools/github | GitHub API integration |
npm install
cd frontend && npm install && npm run build && cd ..
npm run build
npx cdk deployaws cognito-idp admin-create-user \
--user-pool-id <UserPoolId> \
--username user@example.com \
--user-attributes Name=email,Value=user@example.com Name=email_verified,Value=true
aws cognito-idp admin-set-user-password \
--user-pool-id <UserPoolId> \
--username user@example.com \
--password "YourPassword" --permanentcd terraform
cp terraform.tfvars.example terraform.tfvars # fill in values
terraform init && terraform applyThe chaos/fault-injection infrastructure used to generate alarms for the agent to investigate lives in a separate repository and is deployed independently.
- CloudFront is the sole entry point (VPC Origins)
- ALB in private subnet, no public IP
- S3 blocked from public access (OAC)
- Cognito authentication on all API routes
- Secrets in AWS Secrets Manager
~$100-150/mo baseline (Aurora Serverless v2 + NAT Gateway + CloudFront).