Blog Posts

Most Popular Blog Tags

Karpenter Monitoring: Spot Savings and Node Pool Cost Breakdown

Karpenter now exposes enough pricing data to estimate Kubernetes node costs directly from Prometheus metrics. The existing Karpenter dashboards in the kubernetes-autoscaling-mixin already cover node pools, instance types, and scaling behavior. This post focuses on the new cost dashboard: estimated monthly cost, spot instance savings, and node pool cost breakdown in Grafana.

This builds on the earlier posts on Karpenter monitoring with Prometheus and Grafana, comprehensive Kubernetes autoscaling monitoring, and Kubernetes cost tracking with OpenCost. If you want actual cost allocation, shared resource accounting, and historical cost analysis, use OpenCost and the opencost-mixin. This Karpenter dashboard answers a narrower question: what does the current Karpenter-managed node fleet look like in dollars?

Django tracing with OpenTelemetry

Getting distributed tracing working in Django usually means stitching together OpenTelemetry SDK setup, Django instrumentation, OTLP export, database and Redis hooks, outbound HTTP tracing, and Celery propagation. This post walks through a general Django tracing setup with OpenTelemetry, what to instrument, how to propagate traces into Celery, and how to add custom span data. At the end, I cover django-o11y, which packages this into a smaller Django-native setup.

Argo Workflows monitoring with Prometheus and Grafana

Argo Workflows exposes enough metrics to see whether workflows are backing up, CronWorkflows are firing, and the controller is keeping up, but raw /metrics output does not make any of that easy to read. This post covers argo-workflows-mixin, a Prometheus and Grafana mixin that adds two dashboards and a focused alert set for Argo Workflows.

The mixin is available on GitHub. It currently ships with two Grafana dashboards and four alert rules:

Managing multiple AI coding sessions with ai-dash

Managing multiple AI coding sessions gets messy fast once you use more than one tool. Sessions end up scattered across local transcript files, JSONL logs, and SQLite databases. Finding an old session, checking which model was used, or jumping back into the right project usually means opening each tool separately and remembering where it stores state. ai-dash is a local terminal UI for multi-session management across Claude Code, Codex, and OpenCode.

Monitoring Go runtime with Prometheus and Grafana

Go applications expose a useful set of runtime metrics, but raw /metrics output does not make it easy to spot GC pressure, scheduler latency, memory growth, or file descriptor exhaustion. This post covers a go-mixin for Prometheus and Grafana that adds a dashboard and alerts for the Go runtime.

The mixin is available on GitHub. The dashboard is also published in the Grafana dashboard library. It currently ships with one Grafana dashboard and three alert rules:

  • Go / Overview - A dashboard for runtime CPU usage, scheduler latency, garbage collection, heap churn, mutex contention, cgo activity, and file descriptor pressure.
  • GoHighGcCpu - Alerts when a Go process spends too much CPU time in garbage collection.
  • GoHighSchedulerLatency - Alerts when runnable goroutines wait too long to be scheduled.
  • GoHighFdUsage - Alerts when a Go process is close to its file descriptor limit.

The repo also includes generated dashboard JSON and Prometheus rule files, so you can either vendor the mixin into your Jsonnet setup or import the generated files directly.

Introducing django-o11y: traces, logs, metrics, and profiling for Django and Celery

Over the years I've written several blog posts covering different parts of Django observability - Django Monitoring with Prometheus and Grafana, Django Development and Production Logging, Celery Monitoring with Prometheus and Grafana, and Django Error Tracking with Sentry. Each post covers one piece: wiring up django-prometheus, configuring structlog, deploying the Celery exporter, setting up distributed tracing. The problem is that wiring all of it by hand across every project is repetitive and easy to get wrong.

django-o11y bundles those patterns into a single installable package. One DJANGO_O11Y settings dict gets you traces, structured logs, Prometheus metrics, and optional Pyroscope profiling - with all four signals correlated on trace_id.

Reducing Google API Egress Costs with a Simple DNS Change

I enabled Private Google Access (PGA) so our private GKE workloads could reach Google APIs without public node IPs. Everything “worked,” but when I dug into billing and flow logs, I kept seeing Google API traffic (notably storage.googleapis.com) classified as PUBLIC_IP connectivity - and I were getting surprisingly high “carrier peering / egress” charges.

Observability for Headscale: Metrics and Dashboards in Grafana

Headscale is an open source, self-hosted control server compatible with the Tailscale clients. It lets you run your own Tailnet and have full control over users, nodes, keys, and routing policies without relying on Tailscale’s hosted control plane. This post introduces the tailscale-exporter and shows how to collect Headscale metrics via the Headscale gRPC API, and visualize everything in Grafana using dashboards and alerts bundled in the mixin.

Monitoring Envoy and Envoy Gateway with Prometheus and Grafana

Envoy is a popular open source edge and service proxy that's widely used in modern cloud-native architectures. Envoy gateway is a controller that manages Envoy proxies in a Kubernetes environment. Monitoring Envoy and Envoy gateway is crucial for ensuring the reliability and performance of your applications. In this blog post, we'll explore how to monitor Envoy and Envoy gateway using Prometheus and Grafana and we'll also introduce a new monitoring-mixin for Envoy.

With the retirement of ingress-nginx, many users are looking for alternatives for ingress controllers. Envoy gateway is a great option for those who want to leverage the power of Envoy in their Kubernetes clusters. I recently migrated from ingress-nginx and you can read more about it here.

Replacing Ingress-NGINX with Envoy Gateway in My Personal Cluster

With the retirement of ingress-nginx, many users are looking for alternatives for ingress controllers. Envoy gateway looked like a promising option, so I decided to give it a try in my personal Kubernetes cluster. I'll describe my experience deploying Envoy Gateway and how I was able to replicate my previous ingress-nginx setup. This blog post covers my personal cluster and the migration would have been harder in a production environment with more complex requirements.