AI-powered SRE platform for automated incident investigation
-
Updated
Mar 3, 2026 - Python
AI-powered SRE platform for automated incident investigation
An autonomous SRE agent that monitors cloud logs across multiple platforms, leveraging AI models from various providers to detect anomalies, perform root cause analysis, and automate remediation by creating GitHub Pull Requests.
ARF is an agentic reliability intelligence platform that separates decision intelligence (OSS) from governed execution (Enterprise), enabling autonomous operations with deterministic safety guarantees.
AI-powered open-source monitoring platform with auto-remediation. 6 built-in runbooks, MCP integration (global first), DeepSeek root cause analysis. 5-minute Docker setup.
SDK to track cost-per-outcome for AI workflows
Open source code for AIOpsServing
Moso Bamboo (Moso) — an SSH remote AI operations and maintenance system. It is based on FastAPI and native JS SPA, integrating functionalities such as an AI assistant, host management, WebSocket terminal, MCP protocol, Skills, batch operations, and more.
Multi-Agent OS for Claude Code — 134 agents, 15 categories, local-first routing, 14 plugins, cost-aware orchestration
It is an AI-powered DevOps tool that analyzes Linux server logs to detect anomalies and predict failures. It integrates ML models, automated fixes via Ansible, containerization with Docker, and orchestration using Kubernetes—providing a full-stack solution for predictive maintenance.
[OnProgress] Self-Improving AI Agent for Observability Incident Response
Your AI agents are burning money. AImeter shows you exactly how much.
🚀 Enhance Google Cloud operations with the Gemini SRE Agent, automating log monitoring and incident response for smarter site reliability.
🤖 Build and deploy scalable Multi-AI Agent systems with LangGraph and Groq LLMs to enhance intelligence across enterprise applications.
Lightweight server monitoring & AI Ops workflow for Linux (Nginx/Apache). Features real-time metrics, log parsing, smart alerts, and AI-driven analysis. | 轻量级 Linux 服务器监控与 AI 运维工具(支持 Nginx/Apache),提供实时性能看板、日志智能解析、多渠道告警及大模型 AI 故障诊断功能。
Standalone, pluggable observability stack with a read-only MCP interface — one pane of glass across your environments, queryable by both Grafana and LLMs.
Modular, AI-powered IT Operations Platform
The hundred-eyed watcher for your LLM providers. Monitor uptime, TTFT, TPS, and latency across OpenAI, Anthropic, Azure, Bedrock, Ollama, LM Studio, and 100+ providers through a single dashboard. Benchmark, compare, and get alerts — all self-hosted.
Measure your unattended Claude / claude -p / Agent SDK spend before the 2026-06-15 Anthropic billing change. Free, MIT-licensed, no exotic deps. Companion site: wipf.com/headless-claude-billing-prep
ReliaKit TL-15 is an open-source, planet-grade resilience framework for distributed infrastructure. It integrates automated DDoS protection, geo-aware routing, chaos engineering, and symbolic AI hooks to achieve fault tolerance beyond traditional benchmarks.
An AI Agent IaC tool that aims to make developing and deploying AI Agents easier.
Add a description, image, and links to the ai-ops topic page so that developers can more easily learn about it.
To associate your repository with the ai-ops topic, visit your repo's landing page and select "manage topics."