Skip to content

agamm/awesome-ai-sre

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome AI SRE Awesome

Applying artificial intelligence to site reliability engineering — autonomous incident response, intelligent observability, and self-healing infrastructure.

Sponsor

Contents

AI SRE Agents

Autonomous AI agents purpose-built for SRE workflows — investigating alerts, performing root cause analysis, and resolving incidents with minimal human intervention.

  • Resolve AI - Autonomous SRE platform by OpenTelemetry co-creators that targets 80% autonomous resolution rate with parallel hypothesis investigation.
    • Middleware OpsAI - AI SRE agent that detects issues across APM, RUM, Logs, and Kubernetes, traces errors to the exact line of code via GitHub MCP, and opens a PR with a fix or auto-applies it for Kubernetes without waking your on-call engineer.
  • Cleric - Autonomous AI SRE teammate that investigates alerts 24/7 and delivers root cause analysis in Slack.
  • NeuBird - Agentic AI SRE co-pilot for enterprise IT with LLM-powered telemetry analysis and 230K+ alerts resolved.
  • Phoebe AI - Predicts incidents from leading indicators and generates pre-emptive fixes using multi-agent AI swarms.
  • Ciroos AI - Multi-agentic AI SRE teammate built on MCP and A2A architectures for extensible cross-tool orchestration.
  • Dash0 - AI-native observability with specialized agents for on-call triage, PromQL queries, and dashboard automation.
  • Datadog Bits AI - Autonomous AI on-call agent embedded in Datadog that analyzes runbooks and telemetry before responders log in.
  • Harness AI SRE - Human-aware change agent with AI Scribe that captures Slack, Teams, and Zoom signals and correlates them with system changes.
  • Azure SRE Agent - AI agent for monitoring, diagnosing, and resolving issues in Azure-hosted applications with no-code sub-agent builder.
  • Causely - Causal AI engine that determines the single root cause from alert storms using causal reasoning rather than correlation.
  • DrDroid - AI SRE agent with knowledge graph for investigation recommendations, PlayBooks automation, and AlertOps Slack bot.
  • TierZero AI - Autonomous infrastructure issue management that auto-investigates, triages, and resolves infrastructure issues.
  • Kubiya - Agentic engineering platform with natural language Slack and Teams commands, Terraform and CI/CD automation, and role-based access control.
  • SRE.ai - Natural language AI agents for complex enterprise DevOps workflows including CI/CD and testing.
  • Sherlocks.ai - AI-native SRE assistant that automates incident response, root cause analysis, and outage prevention with institutional memory.
  • Parity - AI agent for cloud infrastructure reliability and Kubernetes operations.
  • Beeps - On-call platform that helps developers and agents resolve downtime faster.
  • Kura - AI DevOps copilot for AWS cloud infrastructure management and incident response.
  • Wild Moose - AI first responder for production incidents that investigates and surfaces root cause in under one minute.
  • Nudge Bee - Enterprise AI-agentic workflow platform for SRE and CloudOps with pre-built AI assistants and customizable workflows.
  • Agent SRE - AI agent for autonomous site reliability engineering.
  • Anyshift - AI SRE agent that investigates production incidents by tracing changes across a versioned infrastructure graph to identify root causes.
  • Guardian by Metoro - AI SRE agent for Kubernetes that detects issues, finds the root cause, and opens fix PRs automatically.
  • Hyground - A sovereign AI SRE agent built to operate complex software across your entire stack, automatically find root causes and cut DevOps toil.

AI Production Debugging

AI-powered tools for debugging production applications in real-time — adding observability without redeployments and autonomously remediating code issues.

  • Lightrun - AI SRE platform for autonomous code remediation that lets you add logs, snapshots, and metrics to production without restarts.
  • Sentry Seer - AI debugging agent built on production telemetry that identifies actionable issues, performs root cause analysis, and generates code fixes.

Incident Management

AI-enhanced platforms for managing the full incident lifecycle — detection, triage, response, communication, and post-mortems.

  • PagerDuty AIOps - Enterprise incident management with ML-based noise reduction, AI Agent Suite with SRE Agent and Copilot, and MCP server integration.
  • incident.io - Slack-native incident management with AI SRE, AI alert triage, AI postmortems, Scribe call transcription, and Claude and Cursor integration.
  • Rootly - AI-native incident management with LLM-powered investigation across the observability stack.
  • FireHydrant - AI-powered incident summaries, Zoom-aware context enrichment, and AI-drafted retrospectives. Being acquired by Freshworks.
  • Squadcast - Incident management with AI-driven alert clustering and automatic grouping of related incidents. Acquired by SolarWinds.
  • Zenduty - On-call and incident management with AI Summarizer, AI Postmortem, and AI Scheduling. Acquired by Xurrent, rebranding to Xurrent IMR.
  • BetterStack - Developer-friendly uptime monitoring and incident management with integrated observability.

Observability Platforms

Full-stack observability with AI capabilities — anomaly detection, natural language querying, and intelligent alerting across metrics, logs, and traces.

  • Datadog - Unified SaaS observability with Watchdog AI auto-detection, predictive metrics monitoring, and LLM observability across 600+ integrations.
  • Dynatrace - Full-stack observability with Davis AI engine for continuous dependency analysis, anomaly detection, and Davis CoPilot for natural language remediation.
  • New Relic - Full-stack observability with NRAI assistant for natural language queries and AI-powered anomaly detection.
  • Grafana - Open source observability with Grafana Assistant for natural language queries, autonomous incident investigation, and ML-based anomaly detection.
  • Splunk - Enterprise observability with AI-driven anomaly detection at scale and ITSI with ML-based predictive analytics. Part of Cisco.
  • Elastic AI Assistant - AI assistant in Kibana for natural language log, metrics, and trace querying with contextual alert triage and RAG-powered knowledge base.
  • Honeycomb - Observability for distributed services with Query Assistant, Honeycomb Intelligence, AI-guided Canvas workspace, and hosted MCP server.
  • Coroot - Open source observability with AI-powered root cause analysis and eBPF-based auto-instrumentation.
  • Last9 - Unified observability with Agentic SRE SDK, AI copilot integration with Claude, Cursor, and Slack, and managed TSDB.
  • SigNoz - Open source OpenTelemetry-native observability platform for logs, metrics, and traces with unified correlation analysis.
  • Middleware - Full-stack observability platform that detects issues across APM, RUM, logs, and infrastructure, and resolves them using OpsAI, an AI SRE agent that pinpoints root cause and auto-fixes issues with 70% automated resolution rate.
  • Metoro - Kubernetes native observability platform with built-in eBPF telemetry, AI investigation, deployment verification and root-cause analysis.
  • Radar - Open source Kubernetes observability with topology, service traffic, and event timeline, plus a built-in MCP server and a 31-check best-practices audit for AI assistants.

AIOps Platforms

Platforms that apply ML and AI to IT operations — correlating events, reducing alert noise, and automating operational workflows at scale.

  • BigPanda - AIOps for high-alert-volume environments with event correlation reducing alert volume by 95%+ and AI Incident Assistant.
  • Moogsoft - AIOps with event deduplication, contextual enrichment, intelligent correlation, and automated root cause analysis. Part of Dell Technologies.
  • LogicMonitor - Cloud-based infrastructure monitoring with Edwin AI agent for plain-language summaries, predictive analytics, and capacity forecasting.
  • Selector AI - AI-powered network observability with Network Large Language Model, 90% alert noise reduction, and digital twin modeling.
  • Keep - Open source AIOps and alert management with correlation across monitoring tools and 50+ integrations.

Log Analysis and Anomaly Detection

Specialized tools for AI-driven log analytics, pattern recognition, and automated anomaly detection.

  • Sumo Logic - Cloud-native log analytics with real-time AI-powered anomaly detection and ML-based pattern recognition.
  • Graylog - Open source log management for centralized collection, indexing, and analysis with anomaly alerting.
  • Logz.io - Cloud observability built on ELK and OpenSearch with AI-powered log analysis, ML-based anomaly detection, and MCP server.
  • OpenObserve - Open source high-performance log, metrics, and trace platform with real-time analytics.
  • LogAI - Open source library by Salesforce for log clustering, anomaly detection, and summarization with modular ML pipelines.

Chaos Engineering

Tools for proactively testing system resilience — now enhanced with AI for intelligent experiment design, blast radius control, and automated analysis.

  • ChaosEater - Research tool using LLMs to fully automate the chaos engineering cycle from requirement identification through experiment design, execution, and analysis.
  • Harness Chaos Engineering - Enterprise chaos engineering with LLM-derived test recommendations, intelligent blast radius downscaling, and MCP tool integration.
  • Gremlin - Pioneer commercial chaos engineering tool with attack templates, infrastructure metrics monitoring, and multi-cloud support.
  • Steadybit - Chaos engineering with open source extension framework, resilience policies, and experiment automation.
  • LitmusChaos - CNCF open source chaos engineering for Kubernetes with ChaosHub for shared experiments.
  • Chaos Mesh - CNCF open source Kubernetes-native chaos engineering with comprehensive fault injection for pods, network, IO, time, and kernel.
  • AWS Fault Injection Service - AWS-native chaos engineering with integrated experiment templates and safety controls.

Runbook Automation

AI-powered tools for automating operational runbooks — converting manual procedures into self-executing workflows with intelligent decision-making.

  • Rundeck - Open source and commercial runbook automation with self-service GUI, job scheduling, RBAC, and 1000+ integration plugins. Part of PagerDuty.
  • StackStorm - Open source event-driven automation with rules engine, 6000+ actions, and ChatOps. Used by Netflix for self-healing infrastructure.
  • Ansible Lightspeed - AI-powered Ansible playbook generation via IBM watsonx with natural language to Ansible code and MCP support.
  • RunWhen - Platform for SRE agent orchestration and automated troubleshooting workflows.

Cloud Cost Optimization

AI-driven platforms for optimizing cloud spend — autonomous rightsizing, commitment management, and workload-aware cost allocation.

  • CAST AI - Kubernetes cost optimization with real-time pod rightsizing, autoscaling optimization, predictive capacity forecasting, and advanced bin-packing.
  • Sedai - Autonomous cloud optimization using patented reinforcement learning for rightsizing, workload-aware capacity scaling, and 30-50% cost savings.
  • ProsperOps - Autonomous commitment optimization managing $6B+ annual cloud usage. Acquired by Flexera.
  • Kubecost - Open source Kubernetes cost monitoring with real-time cost allocation and automated rightsizing recommendations.
  • Vantage - Multi-cloud cost management with FinOps Agent for AI-driven savings identification and open source MCP server.
  • nOps - AWS-focused FinOps with AI agent trained on customer data for automated commitment optimization.
  • Finout - Enterprise FinOps with MegaBill for multi-provider cost consolidation and AI-powered cost attribution.
  • Spot.io - Cloud infrastructure automation with spot instance optimization and commitment management. Part of NetApp.
  • CloudPilot AI - Kubernetes-native capacity management with predictive scaling that anticipates usage spikes proactively.

LLM-Powered DevOps Tools

Tools leveraging large language models for natural language interaction with infrastructure, code generation for operations, and AI-assisted DevOps workflows.

  • K8sGPT - CNCF project for AI-powered Kubernetes diagnostics with SRE experience codified into analyzers and multiple LLM backends.
  • HolmesGPT - CNCF Sandbox project providing a 24/7 on-call AI agent with agentic loop querying live observability data from Prometheus, Grafana, Datadog, and Kubernetes.
  • Kube-Copilot - Open source natural language to Kubernetes operations with manifest generation and security scanning.
  • Lens Prism - AI copilot in Lens Desktop for context-aware natural language interaction with live Kubernetes clusters.
  • GitHub Copilot Agent Mode - AI coding assistant with DevOps agent capabilities for infrastructure validation, incident response, and pipeline automation.
  • GitLab Duo - AI throughout the DevSecOps lifecycle with failed job trace analysis, root cause identification, and Security Analyst Agent.
  • Grafana Assistant - AI assistant for natural language dashboard creation, autonomous incident investigation, and query generation.

Agent Benchmarks

Frameworks and benchmarks for evaluating AI SRE agent performance.

  • SRE Bench - Benchmark for evaluating AI SRE agents on realistic operational tasks.

Research Papers

Key academic and industry research on applying AI and ML to site reliability engineering and IT operations.

Blogs and Newsletters

Community Lists

Other curated collections in the AI and operations space.

Contributing

Contributions welcome! Read the contribution guidelines first.

About

A curated list of 100+ AI-powered tools, platforms, and resources for Site Reliability Engineering (SRE) — agents, incident management, observability, AIOps, chaos engineering, and more.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors