SwarmOps is an advanced, multi-agent AI incident response and root-cause analysis platform. It leverages a swarm of specialized AI agents that automatically investigate backend incidents, analyze system metrics, diagnose code-level bugs, and generate deployable Git patches in real-time.
🚀 Live Demo: https://swarmops-1.onrender.com
In traditional software engineering, Mean Time To Resolution (MTTR) for critical production incidents can range from hours to days. When an alert fires, a human Site Reliability Engineer (SRE) or backend developer must manually:
- Dig through thousands of lines of logs in Datadog or ELK.
- Cross-reference those logs with CPU/Memory metrics in Prometheus or Grafana.
- Trace the request path using Jaeger or Zipkin.
- Clone the repository and hunt for the exact line of code causing the bug.
- Write a patch, test it, and submit a Pull Request.
SwarmOps reduces this entire process to minutes.
By orchestrating a swarm of specialized, concurrent AI agents, SwarmOps acts exactly like a Senior SRE team. It parallelizes the investigation—simultaneously reading logs, metrics, traces, and code. It then synthesizes the findings into a single, high-confidence root cause and automatically writes the git patch to fix it. This is not just a chatbot; it is an autonomous incident remediation engine that dramatically cuts down downtime and engineering burnout.
When a production system throws an alert or an incident is reported, SwarmOps automates the investigation using a Sequential AI Swarm Pipeline:
- Triage Agent: Initially classifies the incident and determines the optimal investigation strategy.
- Log Analyzer Agent: Scans application logs for exceptions, stack traces, and anomalies.
- Metrics Agent: Analyzes CPU, Memory, and application latency metrics.
- Trace Agent: Investigates distributed traces to find network bottlenecks or database deadlocks.
- Security Agent: Checks if the incident is a result of a vulnerability (e.g., prompt injection, IDOR) or malicious attack.
- Root Cause Agent: Synthesizes all findings from the previous agents into a definitive root cause analysis.
- Fix Generator Agent: Writes the actual code to fix the bug (generates a
gitpatch). - Validation Agent: Ensures the proposed fix is logical, secure, and passes basic sanity checks.
As the agents work, they stream their progress in real-time over WebSockets to the SwarmOps React Dashboard, giving the user a live, transparent view of the AI "thinking" through the problem.
graph TD
%% Define Styles
classDef frontend fill:#0f172a,stroke:#3b82f6,stroke-width:2px,color:#fff;
classDef backend fill:#064e3b,stroke:#10b981,stroke-width:2px,color:#fff;
classDef ai fill:#4c1d95,stroke:#8b5cf6,stroke-width:2px,color:#fff;
classDef external fill:#78350f,stroke:#f59e0b,stroke-width:2px,color:#fff;
classDef storage fill:#0f172a,stroke:#94a3b8,stroke-width:2px,color:#fff;
subgraph Browser Client
UI[React Dashboard UI]:::frontend
WS_Client[Socket.IO Client]:::frontend
LocalStorage[(Browser localStorage)]:::storage
end
subgraph Server [Monolithic FastAPI Docker Container]
subgraph FastAPI [Backend Services]
Static[Static Asset Server]:::backend
API[REST API Router]:::backend
WS_Server[Socket.IO Server]:::backend
DB[(Persistent JSON Database)]:::storage
Orchestrator[AI Orchestrator Engine]:::backend
end
subgraph Swarm [Autonomous Agent Swarm]
Agent1[Triage Agent]:::ai
Agent2[Log Analyzer]:::ai
Agent3[Metrics Agent]:::ai
Agent4[Trace Agent]:::ai
Agent5[Security Agent]:::ai
Agent6[Root Cause Analyst]:::ai
Agent7[Fix Generator]:::ai
Agent8[Validation Agent]:::ai
end
end
subgraph External Dependencies
GitHub[(GitHub Repositories)]:::external
OpenRouter[OpenRouter API / LLMs]:::external
end
%% Authentication Flow
UI -- "1. OAuth Login" --> OpenRouter
OpenRouter -- "2. Returns API Key" --> LocalStorage
%% API Requests
UI -- "3. POST /incident (X-API-Key)" --> API
API -- "Stores Reports" --> DB
%% WebSockets
WS_Client -- "Real-time Live Feed" <--> WS_Server
%% Server Internal Workings
API -- "Triggers Investigation" --> Orchestrator
Orchestrator -- "Clones repo via Git Tool" --> GitHub
%% Orchestrator to Agents (Workflow)
Orchestrator --> Agent1
Agent1 --> Agent2 & Agent3 & Agent4 & Agent5
Agent2 & Agent3 & Agent4 & Agent5 --> Agent6
Agent6 --> Agent7
Agent7 --> Agent8
%% AI Actions
Swarm -- "Sends Context & Prompts" --> OpenRouter
Swarm -- "Emits Real-time Logs" --> WS_Server
Agent7 -- "Auto-Creates PR on Approval" --> GitHub
SwarmOps is built with a meticulously designed architecture that balances high performance with strict security:
- The Single-Container Monolith: Rather than managing complex microservices, SwarmOps compiles the Vite/React frontend and serves it directly through FastAPI's StaticFiles in a single Docker container. This allows for zero-latency communication between the frontend and the Socket.IO server, vastly simplifying deployment while maintaining enterprise-grade performance.
- Dynamic AI Orchestration: The
Orchestratoruses asynchronous Python (asyncio) to spin up agents. It dynamically instantiates theAsyncOpenAIclient on a per-request basis using the API key passed in the HTTP headers. This ensures absolute tenant isolation—one user's API key can never accidentally be used for another user's request. - Structured Pydantic Validation: All inputs and LLM outputs are strictly validated using Pydantic schemas. This acts as an internal LLM firewall, ensuring the AI agents return valid JSON structures (like
IncidentReport) and preventing hallucinated fields from crashing the UI.
Using SwarmOps is designed to be as seamless as possible:
- Authenticate: Click "Enter Command Center" on the landing page. You will be redirected to OpenRouter to authenticate. This securely provides SwarmOps with the ability to run AI models on your behalf without requiring you to manually copy/paste API keys.
- Report an Incident: On the Dashboard, click New Investigation. Provide a brief description of the bug (e.g., "Users are reporting 500 errors when clicking the checkout button"), the affected service, and the target GitHub repository URL.
- Watch the Swarm: Sit back as the investigation begins. The Live Feed panel will show real-time WebSocket logs of exactly what each agent is thinking and doing. The Pipeline graph will illuminate as phases are completed.
- Review Findings: Once complete, the system will present a synthesized Root Cause, a Confidence Score, and a detailed Code Patch (
git diff) proposed by the Fix Generator agent. - Approve Patch: If the fix looks correct, click Approve & Deploy Fix. SwarmOps will automatically apply the patch and generate a Pull Request to your repository.
SwarmOps is built with a Privacy First mindset:
- No Personal Data Stored: We do not track users or store personal emails/passwords.
- Secure API Key Management: Your OpenRouter API key is securely stored in your browser's
localStorage. The backend only receives the key via headers on a per-request basis and never saves it to our database. - Persistent Analytics: Incident reports and generated patches are permanently stored in a lightweight JSON database (
data/incidents.json). This powers the Dashboard Analytics (allowing you to track your team's efficiency over time) without requiring complex external database connections.
- Python 3.11+
- FastAPI: High-performance async web framework.
- Socket.IO: Real-time event streaming (
python-socketio). - Pydantic: Strict data validation.
- OpenRouter: The LLM engine powering the agents (dynamic client initialization per request).
- React + Vite: Blazing fast frontend build tool.
- Tailwind CSS: Utility-first styling with custom Dark Cosmic Glassmorphism UI.
- Framer Motion: Smooth, high-performance animations and 3D parallax effects.
cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtcd frontend
npm install
npm run buildThe FastAPI backend is configured to serve the compiled React frontend automatically.
cd backend
uvicorn main:socket_app --host 0.0.0.0 --port 8000 --reloadOpen http://localhost:8000 in your browser.
Note: You do not need an .env file! Just click "Sign In with OpenRouter" on the local frontend to authenticate.
SwarmOps is configured as a Single-Container Monolith using a multi-stage Dockerfile. This means the React frontend and Python backend are built and hosted together in the exact same container.
To deploy on Render:
- Create a new Web Service.
- Connect your GitHub repository.
- Set the Root Directory to
backend. - Set the Dockerfile Path to
Dockerfile. - No environment variables are required for deployment since it uses OAuth!
Render will automatically run the multi-stage build (compiling the React app and installing the Python dependencies) and serve the unified application.
Contributions, issues, and feature requests are welcome! Feel free to check the issues page.