No-op indexer jobs for hard-deleted audience members#5458
Conversation
AudienceMember is the first hard-deleted model in the Elasticsearch indexing pipeline; every other indexed model soft-deletes, so the worker's find-at-execution always succeeds for them. When a member is destroyed while its index/update jobs are still queued — e.g. a buyer with several purchases unsubscribing updates the member once per purchase and destroys it on the last — those jobs raise RecordNotFound and burn through all ten retries even though the destroy's own delete job already removed the document. The worker now treats a missing row as a successful no-op, but only for classes in a hard-deleted allowlist, and only after re-checking the primary: the generous retry count exists to ride out replica lag, and a freshly written row invisible to a lagging replica read must keep retrying rather than be mistaken for a deletion. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Greptile SummaryThis PR adds no-op handling to
Confidence Score: 4/5Safe to merge; the primary-check guard correctly prevents replica-lag scenarios from being silently dropped, and all changed paths are covered by new tests. The allowlist check and nil-guard on app/sidekiq/elasticsearch_indexer_worker.rb — the method-level rescue scope and the implicit reliance on Important Files Changed
Sequence DiagramsequenceDiagram
participant S as Sidekiq
participant W as ElasticsearchIndexerWorker
participant R as Replica DB
participant P as Primary DB
participant ES as Elasticsearch
S->>W: "perform("index"/"update", {class_name: "AudienceMember", record_id: X})"
W->>R: AudienceMember.find(X)
R-->>W: RecordNotFound
W->>W: class_name in HARD_DELETED_CLASSES?
alt Not in allowlist (other models)
W-->>S: re-raise RecordNotFound (retries)
else In allowlist (AudienceMember)
W->>P: stick_to_primary! + AudienceMember.exists?(X)
alt Record exists on primary (replica lag)
P-->>W: true
W-->>S: re-raise RecordNotFound (retries)
else Record gone on primary (hard-deleted)
P-->>W: false
W-->>S: no-op (success, no ES call)
end
end
Reviews (1): Last reviewed commit: "No-op indexer jobs for hard-deleted audi..." | Re-trigger Greptile |
| rescue ActiveRecord::RecordNotFound | ||
| # For hard-deleted models a row already gone when its index/update job runs | ||
| # is normal: the destroy's own after_commit enqueued the delete that removes | ||
| # the document, so retrying would never succeed. Everything else — including | ||
| # a freshly written row not yet visible on a lagging replica — keeps the | ||
| # retries that exist to ride the lag out, and the primary re-check stops lag | ||
| # from masquerading as a deletion. | ||
| raise unless params["class_name"].in?(HARD_DELETED_CLASSES) | ||
|
|
||
| ActiveRecord::Base.connection.stick_to_primary! | ||
| searched_id = params["record_id"] | ||
| raise if searched_id.nil? || params.fetch("class_name").constantize.exists?(searched_id) | ||
| end |
There was a problem hiding this comment.
Rescue scope silently covers
update_by_query path
The method-level rescue catches ActiveRecord::RecordNotFound raised anywhere in perform, including inside perform_update_by_query (which calls klass.find(params.fetch("source_record_id")) at line 91). If an AudienceMember source_record_id is hard-deleted, the rescue proceeds into the primary-check branch. It re-raises because params["record_id"] is absent (key is source_record_id), so behaviour is correct today. However, the nil-guard being the only thing preventing a silent no-op for update_by_query jobs is an undocumented invariant — a future caller that passes both source_record_id and record_id in params would silently drop the job. A brief inline note that the searched_id.nil? guard also covers the update_by_query path would make this invariant explicit.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
What
ElasticsearchIndexerWorkertreats a missing database row as a successful no-op instead of raising — but only for models in a newHARD_DELETED_CLASSESallowlist (currently justAudienceMember), and only after confirming against the primary that the row is truly gone. Any otherRecordNotFound— other models, or a row that exists on the primary but wasn't yet visible to a lagging replica read — re-raises and keeps the existing retry behavior unchanged.Why
AudienceMemberis the first hard-deleted model in the indexing pipeline; every other indexed model (Purchase, Link, Balance, Installment) soft-deletes, so the worker'sklass.findat execution time always succeeds for them. Audience members race their own jobs in a common flow: a buyer with several purchases unsubscribes, each purchase's callback updates the member (enqueuing update jobs at +4s and +3min), and the last one destroys it. Every queued update job then raisesActiveRecord::RecordNotFound, burns through all ten retries over hours, and lands in the dead set — pure noise, since the destroy's ownafter_commitalready enqueued the delete that removes the document, so the retried jobs can never succeed. These failures accumulate in the retry set in production today.Two deliberate constraints shape the fix:
stick_to_primary!+exists?distinguishes "deleted" from "not replicated yet", so lag can never masquerade as a deletion and silently drop a document.Existing failed jobs self-heal on deploy: their next Sidekiq retry hits the primary check, confirms the row is gone, and completes as a no-op — no manual retry-set cleanup.
Follow-up to #5449.
Before/After
Non-user-facing. Before: destroying an audience member with queued index/update jobs produced hours of
RecordNotFoundretries per job before dead-setting. After: those jobs complete as no-ops once the primary confirms the deletion, while replica-lag scenarios retain the full retry ladder.Draft — walkthrough video and staging QA steps to be added before marking ready for review.
Test Results
Covers all three branches: hard-deleted member → no-op with no ES calls; member present on primary but invisible to the replica read → re-raises; deleted row of a non-allowlisted model → re-raises.
AI disclosure: implemented with Claude Code using the Claude Fable 5 model.
Need help on this PR? Tag
/codesmithwith what you need. Autofix is disabled.Note
Medium Risk
Changes background indexing behavior for a hard-deleted model; primary re-check limits risk of silently dropping documents, but misclassification on the allowlist could affect search consistency.
Overview
ElasticsearchIndexerWorkernow rescuesActiveRecord::RecordNotFoundon index/update and, for classes inHARD_DELETED_CLASSES(currentlyAudienceMemberonly), no-ops afterstick_to_primary!and anexists?check confirm the row is really gone—so stale jobs after destroy stop retrying and hitting the dead set without touching Elasticsearch.All other models and replica-lag cases (row still on primary) re-raise and keep the existing Sidekiq retry behavior. Specs cover no-op for destroyed audience members, re-raise when the row exists on primary, and re-raise for non-allowlisted models.
Reviewed by Cursor Bugbot for commit b08df09. Bugbot is set up for automated code reviews on this repo. Configure here.