DEV Community: Kasey Steinhauer

Celery worker monitoring: detecting silent failures

Kasey Steinhauer — Tue, 12 May 2026 20:00:53 +0000

Originally posted on celeryradar.com.

Workers are the part of Celery that actually do the work. When they stop, your application's background processing stops. That's the easy part to monitor. The harder part is that workers fail in ways that look healthy from the outside: the process is still running, the broker connection looks fine, the log file's last line is from this morning, and yet tasks aren't getting picked up. By the time somebody on your team notices, a downstream user noticed first.

This guide covers what worker monitoring actually needs to catch (more than "is the process running"), why the three dominant detection approaches each have known blind spots, the five ways workers go silent in production, and the specific implementation trap that causes naive heartbeat setups to fire false alerts during recovery.

Why worker death detection is harder than it looks

Worker monitoring isn't underserved the way beat schedule monitoring is. Every Celery monitoring tool tracks workers in some form. The gap is subtler: each dominant approach has a specific blind spot, and the five most common worker failure modes split across those blind spots so that no single approach catches all of them.

Flower and similar broker-inspect tools query worker state through the broker. Celery's inspect ping command sends a control message and waits for the worker to reply. This works when the worker is healthy and the network path is clean. It misses a few important cases: workers running the solo pool while blocked on a long-running task (the solo pool's main thread handles control commands too, so a stuck task means stuck inspect replies), workers behind certain network configurations where the broker's reply path is unreliable, and prefork workers where the main process itself has stalled on a broker reconnect storm or a slow synchronous transport. Flower will show them as offline; nothing was wrong with the worker process itself.

The APMs (Sentry, Datadog, New Relic) approach worker monitoring primarily from the task side: they instrument task execution and surface errors and slow tasks well. Sentry and New Relic's Python agents are task-side only. If every worker dies, nothing throws, nothing traces, nothing reaches the APM. Sentry Crons covers beat schedules, not workers. Datadog is the partial exception: its Celery integration scrapes Flower's Prometheus endpoint and exposes a per-worker celery.flower.worker.online gauge, so worker absence is visible if you stand up Flower and write the monitor yourself. None of the three ship a preconfigured "worker offline" alert template out of the box.

Process supervision (systemd, supervisord, Kubernetes liveness probes) catches the cleanest failure mode: the process exited. Restart policy kicks in, the worker comes back. What it doesn't catch is the worker process that's still running but has stopped processing tasks. From systemd's view, PID 12345 is alive; from your application's view, nothing's getting done. The liveness probe was wired to the process, not the worker's actual responsiveness.

Each of the three approaches solves part of the problem. None of them, on their own, catches the full set of failure modes that take Celery workers down in production. The rest of this guide is about what proper worker monitoring covers, the five specific failure modes that close the gap, and how to detect each.

What proper worker monitoring entails

Proper worker monitoring is four signals. Three of them are familiar; the fourth is what most homegrown setups miss.

Liveness. Is the worker process running? This is the easy one. systemd reports it, Kubernetes reports it, Flower reports it. Liveness alone is necessary but not sufficient: a process that's alive but no longer processing tasks shows as healthy by every liveness check.

Responsiveness. Is the worker actually picking up and completing tasks? Harder to measure than liveness, and the two common mechanisms (broker inspect Flower-style, and worker-pushed heartbeats) each prove something narrower. Both confirm the worker's main process is alive and its broker path is healthy enough to drive a signal. Neither, on its own, detects a worker whose main process is healthy while child processes are stuck on a long-running task, or whose broker consumer has lost subscription state but kept its connection. Catching the alive-but-stuck case requires a downstream signal: a queue-depth alert on the queues the worker serves. A worker whose heartbeat is current but whose queue is growing past threshold is stuck, even though no worker-level alert will fire. Queue depth alone would catch both the stuck and dead cases eventually, but the lag is proportional to how fast new tasks arrive; the heartbeat catches the dead case in tens of seconds and tells you which host. The two alerts cover different failure modes and are typically run together rather than chosen between. For the heartbeat cadence, every 30 seconds with a 100-to-300 second offline threshold gives enough grace windows to absorb a slow network blip without firing on real-but-brief disconnections.

Identity stability. Does the worker's identifier survive normal operations? Most setups identify workers by hostname. On a Kubernetes deployment, the hostname is the pod name, which rotates on every restart, every rollout, every autoscaler event. A naive setup accumulates offline ghost workers indefinitely: every prior pod sits in the dashboard reading as down, forever. Stable identity requires an explicit override (an env var or kwarg that names the worker independently of hostname) or careful interpretation of dashboard noise.

Out-of-order safety. Late-arriving heartbeats during recovery from a monitoring-side outage shouldn't trigger phantom alerts. Sounds obvious. The implementation is where most naive setups break. Covered in detail below.

The 5 ways workers die silently in production

1. OOM kill

The Linux OOM killer is the most common cause of silent worker death in production. The kernel decides a process has consumed too much memory, picks it as the victim, and sends SIGKILL. The worker has no chance to log anything; SIGKILL is unhandlable. The only trace is in the kernel log (dmesg, journalctl -k) where you'll see a line like Out of memory: Killed process 12345 (celery).

In Kubernetes, the same kernel OOM mechanism applies via cgroups when a container exceeds its memory limit; the kubelet reports the result on the pod with OOMKilled status. The pod restarts (if restart policy permits), but during the restart window tasks are unprocessed. If the underlying memory pattern repeats (a task with a large allocation that the worker doesn't reclaim between runs), the cycle continues: OOM, restart, OOM, restart, with tasks failing or timing out at each cycle.

The detection signal is identical in both environments: the worker stops producing heartbeats. Liveness alone catches the killed process eventually (systemd marks the unit failed; kubelet marks the pod as restarting), but heartbeat absence catches it sooner because heartbeats are pushed on a fixed cadence that doesn't depend on supervisor poll intervals.

2. SIGKILL during deploy

The deployment race. Your deploy pipeline sends SIGTERM to the worker, the worker starts its graceful shutdown (finishing in-flight tasks before exiting), but the supervisor's grace period is shorter than the longest-running task. After the grace period, the supervisor sends SIGKILL.

In Kubernetes, this is the terminationGracePeriodSeconds setting. The default is 30 seconds. Workers running a 60-second task get SIGKILL'd before the task completes; the task is lost (or retried, depending on acks_late). In systemd, TimeoutStopSec plays the same role. The default is 90 seconds, which is enough for most tasks but not for any long-running operation that can't be paused.

The symptom is not "no tasks running" but rather "tasks vanishing mid-execution during deploys." You won't notice it in monitoring that only looks at process state because the process eventually died cleanly. You'll notice it when a customer reports a task they triggered didn't complete, and the audit trail shows the task started but never reached a terminal state. Worker-side, a heartbeat that stops abruptly during a deploy window is the signal; correlating with deploy times tells you whether the cause was the deploy itself.

3. Prefork child crash, parent alive

Celery's default prefork pool runs a main worker process and a pool of child processes that execute tasks. The main process fetches tasks from the broker, dispatches to children, and monitors child health.

When a child crashes (segfault in a C extension, unhandled C-level exception, OOM-killed individually rather than as the whole pool), the main process reaps the child and spawns a replacement. The in-flight task in that child is lost; depending on acks_late configuration, it may or may not be retried. The main process keeps running and continues to look healthy.

The hard-to-debug variant is a recurring child crash that's specific to certain task arguments. Most tasks succeed; a specific subset crashes their executor every time. The main process never reports unhealthy because it's working as designed: spawn child, dispatch task, child dies, spawn replacement. Liveness sees nothing wrong, responsiveness sees nothing wrong (the main process is responsive), and the only catch is in task outcome correlation. This is one of the few failure modes that the per-task breakdown view (retry rate, failure rate per task name) catches better than worker-level monitoring.

4. Broker connection drop without clean reconnect

Workers maintain a persistent connection to the broker (Redis or RabbitMQ). The connection should reconnect automatically if dropped, and usually does. The edge cases that bite are when reconnection succeeds silently from the worker's view but leaves the worker in a state where it's no longer receiving messages.

The Redis variant: a network blip drops the worker's connection. Reconnect establishes a new socket. The known bug pattern is the worker's event loop ending up polling a stale file descriptor or a new socket that wasn't properly registered with the I/O hub; BRPOP never fires again. The worker process is alive, a TCP connection to Redis exists, but new tasks don't arrive at this worker. Largely fixed in Celery 5.5+/kombu 5.4+, but regressions have appeared in 5.6.x. Worth checking the version pinned in your deployment.

The RabbitMQ variant: with acks_late=True, ACK state lives on the AMQP channel. If a channel dies mid-delivery, RabbitMQ requeues the unacked task, but the worker's prefetch slot stays occupied by a zombie task it can no longer ACK. After enough channel drops, every prefetch slot is zombied and the worker consumes nothing despite looking alive. RabbitMQ's 30-minute default consumer_timeout is a related cause for long-running tasks. The worker_cancel_long_running_tasks_on_connection_loss setting (added in Celery 5.1) is the mitigation.

Detection requires the worker to push state, not just maintain a connection. A heartbeat sent over the same broker path that tasks travel through proves both "I'm alive" and "I can use the broker." A heartbeat sent over a separate HTTPS path proves liveness without proving broker reachability, so the broker-disconnect mode can slip through.

5. Hung on a long task or blocking dependency

A worker process is alive and processing one task that's blocked on something: a synchronous database call with a long timeout, an HTTP request to a slow third-party API, a file lock waiting for a process that died. While that task is blocked, the worker can't pick up new tasks. If your concurrency is 1, or if the entire pool is stuck on similarly blocked tasks, the worker is functionally offline.

Process supervision sees the worker as healthy. The broker connection is healthy. The worker is even technically responsive to control commands sent through certain transports. But the queue is filling and tasks aren't moving.

The clearest detection signal is responsiveness measured against throughput, not against heartbeats. A worker that hasn't acked a task in 10 minutes while its queue depth is increasing is stuck even if its heartbeat is current. This is a correlation across two metrics, harder to express as a single threshold and the reason worker-offline alerts pair naturally with queue-depth alerts on the same queue.

Detecting these in production

The detection space breaks into three approaches. Each catches different parts of the five failure modes above. The right answer for most production deployments is a heartbeat-push setup with a sensible offline threshold, but understanding the case for each approach helps you decide what to layer.

Heartbeat push (worker-initiated). The worker reports "I'm alive" on a fixed interval (typically every 30 seconds) to a monitoring service. The service alerts when heartbeats stop arriving for longer than a configured offline threshold (typically 100 to 300 seconds). This catches OOM kill cleanly (process dies, heartbeats stop), SIGKILL during deploy cleanly (same shape), and broker connection loss when heartbeats travel a different path from tasks. It misses prefork child crashes when heartbeats are reported per-main-process (the main process is still up), and it misses hung-on-long-task because the main process is still alive and heartbeating.

Broker inspect (controller-initiated). Periodically send control commands through the broker and wait for the worker's reply. This is what Flower does internally via the Celery app.control.inspect() interface. Catches the same failure modes as heartbeat push, plus partial coverage of hung-on-long-task in cases where the worker's main process is itself stalled (the prefork main process can normally service control commands while a child is blocked, but solo-pool workers and prefork workers whose main process is stuck on broker reconnect don't reply). Adds broker load proportional to worker count times inspect frequency, and reports false offlines when broker control replies are slow or asymmetric.

Process supervision (OS-initiated). systemd, supervisord, Kubernetes liveness probes. Catches the cleanest failure mode (process exit). Misses every alive-but-not-processing case. Fast to set up, no Celery awareness required, and worth running regardless because it handles the auto-restart side that monitoring alone doesn't.

The pragmatic answer is to layer all three. Process supervision handles restart automation. Heartbeat push handles the bulk of failure detection. Broker inspect, where you already have Flower running, catches the additional cases at the cost of broker load. Layering is redundant by design: an OOM kill fires the heartbeat-absence alert and triggers the systemd restart, both signals telling the same story from different angles. That redundancy is the point.

The out-of-order heartbeat trap

There's a specific implementation trap that affects every heartbeat-based worker monitor: out-of-order arrivals.

The naive shape is to write last_seen = received_timestamp on every heartbeat. Correct in the steady state. The problem appears during recovery from a monitoring-side outage. Workers buffer heartbeats while monitoring is unreachable. When monitoring comes back, the buffered heartbeats replay alongside fresh ones. They arrive at the receiver in arrival order, not timestamp order. If the receiver writes last_seen unconditionally, the most recent fresh heartbeat can be followed by an older retried one, and last_seen moves backward. The next offline check fires a false alert: the worker looks like it hasn't reported in seventeen minutes instead of seventeen seconds.

The fix is to enforce MAX(existing, incoming) semantics at write time. Read-compare-write isn't safe under concurrent ingest: two near-simultaneous writes can both pass the comparison and the second one's update clobbers the first. The robust shape is an atomic conditional update at the database level: UPDATE workers SET last_seen = $new WHERE hostname = $h AND last_seen < $new. Postgres evaluates the predicate inside the update; concurrent writes serialize cleanly.

This is one of three layers of redundancy in CeleryRadar's worker monitoring. The other two are a bounded retry queue in the SDK (which preserves heartbeats during outages so they can replay rather than disappearing) and a 10-minute startup grace on the alert engine (which suppresses absence-based alerts for the first ten minutes after the alert worker boots, so backfilled heartbeats land before any evaluator runs). Each layer covers a different failure mode in the recovery path; together they're robust regardless of which mechanism does the heavy lifting on any given recovery.

Setting up worker monitoring with CeleryRadar

If worker monitoring with out-of-order safety, fork safety, and Kubernetes-friendly identity is what you want without writing it yourself, CeleryRadar handles it as part of the standard SDK setup.

pip install celeryradar-sdk

In your Celery app:

# myproject/celery.py
import os
import celeryradar_sdk
from celery import Celery

app = Celery("myproject")
app.config_from_object("django.conf:settings", namespace="CELERY")
app.autodiscover_tasks()

celeryradar_sdk.connect(
    api_key=os.environ["CELERYRADAR_API_KEY"],
    app_name="myproject",
)

The SDK pushes a heartbeat every 30 seconds via Celery's heartbeat_sent signal, rebuilds itself correctly across prefork forks (the parent's TCP connection would otherwise dupe across child processes), and the backend handles the out-of-order arrival case server-side. On Kubernetes, set the CELERYRADAR_WORKER_NAME environment variable on your deployment to override the pod-name-as-hostname default. Pick a stable identifier (the deployment name plus the ordinal, or a fixed string per logical worker class) so the dashboard doesn't accumulate offline ghosts on every pod rotation.

Add a worker_offline alert rule from the rules page. Pick a hostname from the dropdown (the dropdown is sourced from heartbeats CeleryRadar has actually received, so typos can't pass), set the absence threshold (100 seconds is the floor; 180 to 300 is a reasonable default for most workloads), pick a delivery channel, and save.

The five failure modes above land on the dashboard differently. OOM kill and SIGKILL during deploy fire worker_offline cleanly. Prefork child crashes don't fire that alert because the main process is still alive; they surface on the per-task breakdown page as elevated retry and failure rates, and a task_failure_rate alert on the affected task name catches the recurring-on-specific-args case actively. Broker connection drops without clean reconnect fire worker_offline since heartbeats stop arriving. Hung-on-long-task is the hardest case; the main process keeps heartbeating, so the signal is queue depth growing against worker count staying steady. Pair the worker_offline alert with a queue_depth_threshold alert on the queues that worker serves to cover that mode.

Try CeleryRadar free

Closing

Worker monitoring is where the gap between "we monitor workers" and "we'd catch this in production" usually lives. The three dominant approaches (broker inspect, APM task instrumentation, process supervision) each handle part of the failure space, and the five most common failure modes split across them so that no single approach is sufficient. The pragmatic shape is heartbeat-push for the bulk of detection, process supervision for restart automation, and broker inspect where your existing Flower setup catches the additional cases. Out-of-order safety on the heartbeat path is the implementation detail that separates "fires false alerts during every recovery" from "just works."

If beat schedule monitoring is also a gap in your setup, the same SDK installation handles that automatically. See the companion guide on Celery beat monitoring. For the broader picture across tasks, workers, queues, and schedules together, the main guide on monitoring Celery in production covers the full signal map.

Celery beat monitoring: the underserved problem

Kasey Steinhauer — Thu, 07 May 2026 15:39:51 +0000

Beat is the part of Celery that fires scheduled tasks. It's also the most overlooked part of Celery monitoring. Tasks get dashboards, workers get heartbeats, queues get depth charts. Beat gets a config file and a hopeful "it's been running fine for months." When beat fails, your scheduled work just stops. No errors, no alerts, just silence.

This guide covers why beat monitoring is underserved across the Celery ecosystem, what proper beat schedule monitoring should actually cover, and the six specific ways beat fails in production that monitoring needs to catch.

Why beat monitoring is overlooked

The reason beat monitoring is underserved is structural, not accidental. The tools that dominate Celery monitoring all came from somewhere else, and beat sits in the gap between their primary signals.

Flower came from the "what's happening right now" angle. It's a real-time inspector, a web UI for currently-running tasks and currently-online workers. Beat schedules are mostly invisible to a real-time inspector because a fire happens once and disappears into the task stream. You can see the task the fire produced, but you can't easily see whether the schedule fired on time, or whether it's still firing at all. Flower has no native beat monitoring.

The APMs (Sentry, Datadog, New Relic) came from error tracking and performance monitoring. They treat your Celery deployment as a stream of tagged transactions, which works for noticing when a task starts throwing exceptions, but doesn't tell you when a scheduled task simply didn't fire. Sentry Crons covers crons specifically, but only when you opt in by setting monitor_beat_tasks=True plus per-schedule decorators. Most teams don't realize they need both. Datadog scrapes Flower's Prometheus endpoint and inherits Flower's blind spot. New Relic auto-instruments tasks, not schedules.

Cronitor and similar cron-monitoring services do beat-style monitoring well, but they assume you'll instrument each schedule manually with a heartbeat ping at the start and end of each run. That works for traditional cron, but it's a lot of decorator-glue for a Celery beat schedule that may dynamically register itself via django-celery-beat or RedBeat after the deploy.

The result is a category gap. There's no first-class tool whose primary signal is "did this beat schedule fire when it should have?" Most monitoring stacks technically can answer the question if you wire it up by hand, but very few do. Most teams find out their beat monitoring was incomplete the same way: a scheduled job stops firing, weeks pass, somebody downstream notices the missing output, and the timeline reconstruction starts.

What proper beat monitoring entails

Proper beat schedule monitoring isn't one signal. It's four:

Registration. Did your scheduled task make it into beat's internal registry? With django-celery-beat, the PeriodicTask row needs to exist and be marked enabled. With RedBeat, the schedule key needs to exist in Redis. With the default scheduler, the entry needs to be present in app.conf.beat_schedule. A schedule that never registered will never fire, and the failure is silent because there's nothing for monitoring to track the absence of.

Fire detection. For each registered schedule, did each expected fire window actually produce a fire? A schedule set for every five minutes that hasn't fired in the last seventeen minutes has missed three windows. Most monitoring tools don't track this directly. They only see the task that the fire produced, if it produced one. The window in which a fire should have happened but didn't is the part most setups can't observe.

Drift detection. When a fire happens, does it happen on time? A schedule set for midnight that fires at 12:23 every day has clock drift, broker contention, or scheduler lag. Drift often shows up as small lag first and full misses later. Catching drift early is how you find the failure mode before it escalates.

Task outcomes. When the fire dispatches a task, does the task succeed? Beat firing correctly is necessary but not sufficient. If the task it dispatches fails on every attempt, your scheduled work isn't getting done. This is the only one of the four that overlaps with general task monitoring.

Most tools cover one or two of these. Sentry Crons covers fire detection and task outcomes well, but only with explicit decorators. Flower covers task outcomes but not fire detection. The Prometheus-and-Grafana DIY approach can cover all four if you build it carefully, but most teams don't.

The rest of this guide assumes you want all four, then walks through the six specific failure modes that complete beat monitoring needs to catch.

The 6 ways beat fails in production

1. The beat process isn't running

Most common, most embarrassing. The beat process never started, or it crashed and didn't get restarted. Reasons: the systemd unit failed at boot (missing dependency, typo in ExecStart); supervisor config was wrong (worker process running but no beat process defined); the Kubernetes deployment is in CrashLoopBackoff; somebody on the team killed the process during an unrelated debugging session and forgot to restart it.

It's silent because nothing produces tasks at all. Worker monitoring won't catch it (the workers are fine, they just have nothing to do). Task monitoring won't catch it (no failed tasks, just no tasks). Queue depth won't catch it (depth is zero, which looks healthy).

Detection has to come from outside the beat process. Monitoring needs to know your schedules' expected fire windows independently and alert when expected fires don't happen.

2. The schedule entry never registered

The schedule was added in code but never made it into the beat instance's live registry. The variants:

django-celery-beat: the PeriodicTask row exists but enabled=False, or the migration that creates it didn't run in the deployed environment.
RedBeat: the schedule key wasn't written to Redis (race during startup, or beat connected to the wrong Redis database).
Default scheduler: app.conf.beat_schedule wasn't reloaded after a deploy that added a new entry.
Custom schedulers: the scheduler doesn't poll for new entries until the next restart, and your deploy added an entry mid-cycle.

The "I deployed the change and it didn't take" failure splits into two cases. If a schedule was previously registered and then removed (intentionally or by mistake), CeleryRadar deactivates it cleanly. That's the right default for intentional removals (no phantom alerts forever after you delete something), and unintentional ones show up as deactivated rows in the schedules page rather than alerts. The "added in code, never loaded by beat" variant is harder for any auto-discovery tool to catch and is better served by per-schedule decorators (Sentry Crons, Cronitor) where the decorator's invocation is itself the source-of-truth signal.

3. Clock or timezone drift

Beat fires off the host's clock. If the clock is wrong, fires are wrong. The classic shapes:

VM or container clock skew because NTP isn't running or is misconfigured.
Timezone mismatches: CELERY_TIMEZONE set to one zone, the database to another, the application server to a third. Each schedule's interpretation depends on which timezone wins, and it's not always the one you expected.
DST transitions: schedules set in local time miss or duplicate fires twice a year.
Container scheduling: starting a container with the wrong TZ env var shifts the entire schedule by however many hours the offset is.

Detection: monitoring should know each schedule's expected fire times in absolute UTC, compare against actual fires, and alert on consistent drift past a small threshold (60 seconds is reasonable; anything less is broker latency noise).

4. Beat lock contention

Running multiple beat instances by accident. The default scheduler doesn't distribute its lock. Running two beats means every fire happens twice, every scheduled task runs twice, and idempotency assumptions in your task code start mattering in ways they didn't before.

RedBeat's distributed lock via Redis is the standard fix. It works correctly when configured, but lock-acquisition failures are silent: one beat wins the lock and runs schedules, the others sit idle waiting. When the winner dies, one of the idle beats takes over. Usually. If lock handoff fails (Redis evicts the lock key, the winner crashes without releasing, two beats race for the lock at the same instant), schedules can stop firing entirely while every beat process appears alive.

The most common variant is a deploy that creates a new beat pod before the old one terminates; both fire for a few minutes until the old one drains. Detection is duplicate fires (lock missing) on one side and missed fires despite the beat process appearing alive (lock stuck) on the other.

5. Broker connection loss

Beat publishes to the broker (Redis or RabbitMQ). If the connection drops silently, fires happen but the messages don't land. Beat thinks it's working; nothing's actually getting done.

Redis-specific failure: keys get evicted under memory pressure, especially with maxmemory-policy: allkeys-lru and no separate Redis instance for Celery. RabbitMQ-specific failure: the channel drops without reconnecting cleanly; beat keeps trying to publish but messages don't reach the queue. The reconnect logic exists but is timing-dependent, and edge cases (DNS hiccups, partial network partitions) can leave it in a degraded state.

The symptom is unique among these failure modes: beat process appears healthy, schedules appear registered, drift looks fine, but the tasks beat is supposedly dispatching never run. Detection requires correlating fires (what beat published) with task events (what workers received). A fire without a corresponding task event is broker-side loss.

6. Custom scheduler bugs

Most teams use django-celery-beat or RedBeat, but some run a custom scheduler: third-party packages, internal tooling, or a subclass of the default. Custom schedulers have a few specific failure shapes:

Future Celery versions can rename internal scheduler attributes (e.g. private fields like _orig_minute); custom schedulers that subclass and reach into internals break on Celery upgrade.
Re-sync intervals don't match what your code assumes. The scheduler may only check for new entries every 60 seconds; an entry you add expecting it to fire in 30 won't.
Edge cases in cron parsing. Every scheduler implements cron-expression handling slightly differently, and the corner cases (overlapping ranges, step values, unusual day-of-week semantics) are where the bugs live.

Less common than the other five, but causes prolonged incidents because few people on the team know to check the scheduler-specific code path. Detection is the same as failure mode 2 (compare expected vs registered) plus the same as failure mode 3 (compare expected fire times vs actual).

Detecting these in production

The pattern across all six failure modes is the same: monitoring needs to know what your schedules should fire, and when, independently of the beat process. If monitoring can only see what beat reports, every failure mode where beat is wrong about its own state is invisible.

The mechanisms that work:

Per-schedule expected-fire-window tracking. Store each schedule's cron expression or interval, the last fire time, and the next expected fire time. Every minute, compute "should X have fired by now?" and alert when yes-but-it-didn't. This catches failure modes 1, 4 (lock-stuck variant), 5 (full broker disconnect), and 6.
Registry snapshot diffing. Track which schedules appear in successive snapshots from beat. Schedules that disappear get auto-deactivated, which keeps the dashboard clean after intentional removals and surfaces unintentional ones as deactivated rows in the schedules page. Note that this isn't an alert path; the never-registered-in-the-first-place variant of failure mode 2 requires per-schedule decorators (Sentry Crons, Cronitor) since the decorator's invocation is the source-of-truth signal.
Fire-to-task correlation. A fire event is when beat publishes to the broker; a task event is when a worker picks it up. Tracking both lets you spot fires that didn't produce tasks (broker-side loss, failure mode 5).
Drift alarms. Compare actual fire times against expected fire times. Alert on consistent drift past threshold. Catches failure mode 3.

There are two mature ways to get this in production. The first is to instrument each schedule manually with Sentry Crons (monitor_beat_tasks=True plus per-schedule decorators) or with Cronitor heartbeat pings. Both work well for teams already using those tools, but require per-schedule wiring and miss the registration failure mode entirely (an unregistered schedule has no decorator firing because no fire is happening).

The second is to use a tool that auto-discovers your beat schedules and tracks fire windows independently of beat. That's the gap CeleryRadar specifically fills.

Setting up beat monitoring with CeleryRadar

If beat schedule monitoring is a gap in your current setup and you don't want to instrument each schedule by hand, CeleryRadar handles it automatically.

pip install celeryradar-sdk

In your Celery app:

# myproject/celery.py
import os
import celeryradar_sdk
from celery import Celery

app = Celery("myproject")
app.config_from_object("django.conf:settings", namespace="CELERY")
app.autodiscover_tasks()

celeryradar_sdk.connect(
    api_key=os.environ["CELERYRADAR_API_KEY"],
    app_name="myproject",
)

The SDK hooks Celery's beat_init signal to read your registered schedules at beat startup, periodically re-syncs (so dynamic additions via django-celery-beat or RedBeat are picked up without a beat restart), and tracks every fire via before_task_publish. The backend computes expected fire windows server-side and materializes a "missed" record for any window that passes its grace period without a fire.

From there, add a beat_miss alert rule. Pick a schedule, set the consecutive-misses threshold (1 for high-frequency schedules where every miss matters; 2-3 for noisier crons), pick a delivery channel (Slack, Discord, email).

That's the entire wiring. The failure modes that surface as beat_miss alerts are 1 (beat process down), 4 (lock-stuck), 5 (broker disconnect), and 6 (scheduler bugs that prevent fires). Four of the six caught with no per-schedule instrumentation. Mode 3 (drift) shows in the dashboard as last-fired-vs-expected divergence but doesn't alert directly unless drift is severe enough to push a fire past its grace window. Mode 2 (registration) splits: registered schedules that get removed deactivate cleanly (correct behavior for intentional removals, visible in the schedules page for unintentional ones), and the rarer "added in code but never loaded" variant is best caught with per-schedule decorators on top.

Get started, free →

Closing

Beat is the underserved part of Celery monitoring because the tools that dominate the Celery monitoring space came from adjacent problems: real-time inspection (Flower), error tracking (Sentry), broad APM (Datadog, New Relic), generic cron monitoring (Cronitor). Each handles part of the four-signal coverage; none handle all of it natively without per-schedule wiring.

If you're already running one of those tools and your beat schedules are simple enough that the manual wiring is fine, that's the cheapest answer. If you're in the larger group (teams with dynamically-registered schedules, a mix of django-celery-beat and ad-hoc entries, or just no appetite for per-schedule decorators), beat monitoring as a first-class signal is what closes the gap.

For the rest of the Celery monitoring picture (tasks, workers, queue depth), see the full guide on monitoring Celery in production.