Build software better, together

incidentfox / incidentfox

AI-powered SRE platform for automated incident investigation

devops cloud on-call observability incident-management ai-ops ai-sre

Updated Mar 3, 2026
Python

An autonomous SRE agent that monitors cloud logs across multiple platforms, leveraging AI models from various providers to detect anomalies, perform root cause analysis, and automate remediation by creating GitHub Pull Requests.

python aws devops automation cloud log-analysis incident-response gcp google-cloud sre resilience log-monitoring ai-agents platform-engineering vertex-ai llm ai-ops gemini-ai

Updated Feb 20, 2026
Python

petterjuan / agentic-reliability-framework

Star

ARF is an agentic reliability intelligence platform that separates decision intelligence (OSS) from governed execution (Enterprise), enabling autonomous operations with deterministic safety guarantees.

devops reliability-engineering python-library sre observability incident-management self-healing anomaly-detection ai-agents autonomous-systems graph-memory production-monitoring mlops-workflow ai-ops observability-platform ai-infrastructure self-healing-infrastructure

Updated Mar 2, 2026
Python

LinChuang2008 / nightmend

Star

AI-powered open-source monitoring platform with auto-remediation. 6 built-in runbooks, MCP integration (global first), DeepSeek root cause analysis. 5-minute Docker setup.

react docker open-source devops ai monitoring mcp incident-response alerting self-hosted infrastructure-monitoring observability self-healing auto-remediation aiops fastapi ai-ops

Updated Apr 24, 2026
Python

botanu-ai / botanu-sdk-python

Star

SDK to track cost-per-outcome for AI workflows

machine-learning tracing observability cost-optimization finops roi-analysis opentelemetry opentelemetry-python enterprise-solutions llm ai-ops outcomes-analytics genai genai-usecase cloud-cost-efficiency

Updated Apr 25, 2026
Python

alibaba / AIOpsServing

Star

Open source code for AIOpsServing

machine-learning model-serving model-benchmarking ai-ops mlflow-compatible alicloud-compatible

Updated Mar 23, 2023
Python

messageloop2025 / Moso

Star

Moso Bamboo (Moso) — an SSH remote AI operations and maintenance system. It is based on FastAPI and native JS SPA, integrating functionalities such as an AI assistant, host management, WebSocket terminal, MCP protocol, Skills, batch operations, and more.

ssh devops mcp web-terminal ai-ops

Updated Jun 11, 2026
Python

SkyWalker2506 / claude-config

Star

Multi-Agent OS for Claude Code — 134 agents, 15 categories, local-first routing, 14 plugins, cost-aware orchestration

Updated Jun 1, 2026
Python

AmSh4 / Logs_Guard-AI

Star

It is an AI-powered DevOps tool that analyzes Linux server logs to detect anomalies and predict failures. It integrates ML models, automated fixes via Ansible, containerization with Docker, and orchestration using Kubernetes—providing a full-stack solution for predictive maintenance.

react docker kubernetes flask ansible devops machine-learning log-analysis deep-learning ci-cd full-stack system-monitoring anomaly-detection predictive-maintenance ai-ops

Updated Aug 21, 2025
Python

telemetryflow / telemetryflow-hermes

Star

[OnProgress] Self-Improving AI Agent for Observability Incident Response

open-source devops telemetry hermes devops-tools observability otel it-ops opentelemetry otlp ai-observability devopscorner ai-ops observability-platform telemetryflow hermes-agent telemetryflow-platform

Updated Jun 9, 2026
Python

saileshr / aimeter

Star

Your AI agents are burning money. AImeter shows you exactly how much.

openai observability ai-agents cost-optimization ai-ops langchain llm-cost agent-billing

Updated Apr 19, 2026
Python

khael-kun-cmd / gemini-sre-agent

Star

🚀 Enhance Google Cloud operations with the Gemini SRE Agent, automating log monitoring and incident response for smarter site reliability.

python devops automation incident-response google-cloud sre resilience log-monitoring vertex-ai ai-ops gemini-ai

Updated Jun 11, 2026
Python

omri3193 / Enterprise-Multi-AI-Agent-Systems-

Star

🤖 Build and deploy scalable Multi-AI Agent systems with LangGraph and Groq LLMs to enhance intelligence across enterprise applications.

reliability-engineering python-library swarm self-healing codex autonomous-agents swarm-intelligence anomaly-detection autonomous-systems ai-ops observability-platform agentic-framework agentic-workflow agentic-ai mcp-server claude-code agentic-engineering self-healing-infrastructure

Updated Jun 12, 2026
Python

tankeito / server-mate

Star

Lightweight server monitoring & AI Ops workflow for Linux (Nginx/Apache). Features real-time metrics, log parsing, smart alerts, and AI-driven analysis. | 轻量级 Linux 服务器监控与 AI 运维工具（支持 Nginx/Apache），提供实时性能看板、日志智能解析、多渠道告警及大模型 AI 故障诊断功能。

python nginx devops automation log-analysis dashboard apache server-monitoring ai-ops

Updated Apr 26, 2026
Python

sidkos / panoptes

Star

Standalone, pluggable observability stack with a read-only MCP interface — one pane of glass across your environments, queryable by both Grafana and LLMs.

python kubernetes devops monitoring mcp terraform grafana prometheus sre observability opentelemetry victoriametrics ai-ops model-context-protocol

Updated Jun 7, 2026
Python

natorus87 / ninko

Star

Modular, AI-powered IT Operations Platform

agent docker kubernetes ai self-hosted assistant opnsense proxmox homelab infrastructure-automation llm ai-ops open-llm local-ai ollama lm-stu agentic-workflow agentic-ai

Updated Jun 11, 2026
Python

bluet / arguslm

Star

The hundred-eyed watcher for your LLM providers. Monitor uptime, TTFT, TPS, and latency across OpenAI, Anthropic, Azure, Bedrock, Ollama, LM Studio, and 100+ providers through a single dashboard. Benchmark, compare, and get alerts — all self-hosted.

Updated Jun 10, 2026
Python

frothlick / claude-headless-meter

Star

Measure your unattended Claude / claude -p / Agent SDK spend before the 2026-06-15 Anthropic billing change. Free, MIT-licensed, no exotic deps. Companion site: wipf.com/headless-claude-billing-prep

developer-tools claude cost-monitoring ai-ops anthropic claude-code agent-sdk llm-tooling

Updated May 31, 2026
Python

zebadiee / reliakit-tl15

Star

ReliaKit TL-15 is an open-source, planet-grade resilience framework for distributed infrastructure. It integrates automated DDoS protection, geo-aware routing, chaos engineering, and symbolic AI hooks to achieve fault tolerance beyond traditional benchmarks.

kubernetes devops infrastructure-as-code observability resilience chaos-engineering ai-ops

Updated Jan 22, 2026
Python

willwoodward / woodwork-engine

Star

An AI Agent IaC tool that aims to make developing and deploying AI Agents easier.

python ai python-package config-language mlops ai-agent llm ai-ops llm-ops ai-agent-framework

Updated May 7, 2026
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-ops

Here are 48 public repositories matching this topic...

incidentfox / incidentfox

avivl / cloud-sre-agent

petterjuan / agentic-reliability-framework

LinChuang2008 / nightmend

botanu-ai / botanu-sdk-python

alibaba / AIOpsServing

messageloop2025 / Moso

SkyWalker2506 / claude-config

AmSh4 / Logs_Guard-AI

telemetryflow / telemetryflow-hermes

saileshr / aimeter

khael-kun-cmd / gemini-sre-agent

omri3193 / Enterprise-Multi-AI-Agent-Systems-

tankeito / server-mate

sidkos / panoptes

natorus87 / ninko

bluet / arguslm

frothlick / claude-headless-meter

zebadiee / reliakit-tl15

willwoodward / woodwork-engine

Improve this page

Add this topic to your repo