Skip to content

No-op indexer jobs for hard-deleted audience members#5458

Merged
ershad merged 1 commit into
mainfrom
indexer-skip-hard-deleted-rows
Jun 11, 2026
Merged

No-op indexer jobs for hard-deleted audience members#5458
ershad merged 1 commit into
mainfrom
indexer-skip-hard-deleted-rows

Conversation

@ershad

@ershad ershad commented Jun 11, 2026

Copy link
Copy Markdown
Member

What

ElasticsearchIndexerWorker treats a missing database row as a successful no-op instead of raising — but only for models in a new HARD_DELETED_CLASSES allowlist (currently just AudienceMember), and only after confirming against the primary that the row is truly gone. Any other RecordNotFound — other models, or a row that exists on the primary but wasn't yet visible to a lagging replica read — re-raises and keeps the existing retry behavior unchanged.

Why

AudienceMember is the first hard-deleted model in the indexing pipeline; every other indexed model (Purchase, Link, Balance, Installment) soft-deletes, so the worker's klass.find at execution time always succeeds for them. Audience members race their own jobs in a common flow: a buyer with several purchases unsubscribes, each purchase's callback updates the member (enqueuing update jobs at +4s and +3min), and the last one destroys it. Every queued update job then raises ActiveRecord::RecordNotFound, burns through all ten retries over hours, and lands in the dead set — pure noise, since the destroy's own after_commit already enqueued the delete that removes the document, so the retried jobs can never succeed. These failures accumulate in the retry set in production today.

Two deliberate constraints shape the fix:

  • Primary re-check before swallowing. The worker's generous retry count doubles as the replica-lag buffer (the +3-minute re-enqueue exists for the same reason): a freshly written row can be invisible to a replica read, and those jobs must keep retrying until the replica catches up. stick_to_primary! + exists? distinguishes "deleted" from "not replicated yet", so lag can never masquerade as a deletion and silently drop a document.
  • Allowlist instead of blanket rescue. For soft-deleted models a missing row always indicates lag or a bug, and their behavior stays byte-identical — the new code is unreachable for them. The constant documents the invariant a future class must satisfy to join: its destroy path must reliably enqueue the document delete.

Existing failed jobs self-heal on deploy: their next Sidekiq retry hits the primary check, confirms the row is gone, and completes as a no-op — no manual retry-set cleanup.

Follow-up to #5449.

Before/After

Non-user-facing. Before: destroying an audience member with queued index/update jobs produced hours of RecordNotFound retries per job before dead-setting. After: those jobs complete as no-ops once the primary confirms the deletion, while replica-lag scenarios retain the full retry ladder.

Draft — walkthrough video and staging QA steps to be added before marking ready for review.

Test Results

spec/sidekiq/elasticsearch_indexer_worker_spec.rb   16 examples, 0 failures

Covers all three branches: hard-deleted member → no-op with no ES calls; member present on primary but invisible to the replica read → re-raises; deleted row of a non-allowlisted model → re-raises.


AI disclosure: implemented with Claude Code using the Claude Fable 5 model.


View with Codesmith Autofix with Codesmith
Need help on this PR? Tag /codesmith with what you need. Autofix is disabled.


Note

Medium Risk
Changes background indexing behavior for a hard-deleted model; primary re-check limits risk of silently dropping documents, but misclassification on the allowlist could affect search consistency.

Overview
ElasticsearchIndexerWorker now rescues ActiveRecord::RecordNotFound on index/update and, for classes in HARD_DELETED_CLASSES (currently AudienceMember only), no-ops after stick_to_primary! and an exists? check confirm the row is really gone—so stale jobs after destroy stop retrying and hitting the dead set without touching Elasticsearch.

All other models and replica-lag cases (row still on primary) re-raise and keep the existing Sidekiq retry behavior. Specs cover no-op for destroyed audience members, re-raise when the row exists on primary, and re-raise for non-allowlisted models.

Reviewed by Cursor Bugbot for commit b08df09. Bugbot is set up for automated code reviews on this repo. Configure here.

AudienceMember is the first hard-deleted model in the Elasticsearch
indexing pipeline; every other indexed model soft-deletes, so the
worker's find-at-execution always succeeds for them. When a member is
destroyed while its index/update jobs are still queued — e.g. a buyer
with several purchases unsubscribing updates the member once per
purchase and destroys it on the last — those jobs raise RecordNotFound
and burn through all ten retries even though the destroy's own delete
job already removed the document.

The worker now treats a missing row as a successful no-op, but only for
classes in a hard-deleted allowlist, and only after re-checking the
primary: the generous retry count exists to ride out replica lag, and a
freshly written row invisible to a lagging replica read must keep
retrying rather than be mistaken for a deletion.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@ershad ershad self-assigned this Jun 11, 2026
@ershad ershad marked this pull request as ready for review June 11, 2026 19:14
@ershad ershad merged commit bcfdbe8 into main Jun 11, 2026
8 of 9 checks passed
@ershad ershad deleted the indexer-skip-hard-deleted-rows branch June 11, 2026 19:14
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds no-op handling to ElasticsearchIndexerWorker for hard-deleted models, specifically AudienceMember. When find raises RecordNotFound for a job whose class is in the new HARD_DELETED_CLASSES allowlist, the worker confirms the deletion against the primary DB before silently completing — preserving full retry behaviour for replica-lag scenarios and for all soft-deleted models.

  • Adds HARD_DELETED_CLASSES constant and a method-level rescue ActiveRecord::RecordNotFound that gates no-op treatment behind both an allowlist check and a stick_to_primary! + exists? primary re-check.
  • Adds three spec cases covering: true hard-deletion (no-op), replica-lag re-raise, and non-allowlisted model re-raise.

Confidence Score: 4/5

Safe to merge; the primary-check guard correctly prevents replica-lag scenarios from being silently dropped, and all changed paths are covered by new tests.

The allowlist check and nil-guard on record_id together produce correct behaviour across all current call sites. The one gap is that the method-level rescue also wraps the update_by_query code path — perform_update_by_query can raise RecordNotFound, and an AudienceMember update_by_query job would enter the rescue handler. Today it re-raises because record_id is absent from that params shape, but this invariant is undocumented and could silently drop jobs if the params shape ever changes. No tests cover this code path through the rescue.

app/sidekiq/elasticsearch_indexer_worker.rb — the method-level rescue scope and the implicit reliance on record_id being absent for update_by_query jobs warrant a second look.

Important Files Changed

Filename Overview
app/sidekiq/elasticsearch_indexer_worker.rb Adds a method-level rescue for RecordNotFound with an allowlist guard and primary-DB re-check; logic is correct for the intended index/update paths but the rescue scope also covers update_by_query silently.
spec/sidekiq/elasticsearch_indexer_worker_spec.rb Adds three new cases: hard-deleted member no-op (index + update), replica-lag re-raise, and non-allowlisted model re-raise; all three branches are exercised correctly.

Sequence Diagram

sequenceDiagram
    participant S as Sidekiq
    participant W as ElasticsearchIndexerWorker
    participant R as Replica DB
    participant P as Primary DB
    participant ES as Elasticsearch

    S->>W: "perform("index"/"update", {class_name: "AudienceMember", record_id: X})"
    W->>R: AudienceMember.find(X)
    R-->>W: RecordNotFound

    W->>W: class_name in HARD_DELETED_CLASSES?
    alt Not in allowlist (other models)
        W-->>S: re-raise RecordNotFound (retries)
    else In allowlist (AudienceMember)
        W->>P: stick_to_primary! + AudienceMember.exists?(X)
        alt Record exists on primary (replica lag)
            P-->>W: true
            W-->>S: re-raise RecordNotFound (retries)
        else Record gone on primary (hard-deleted)
            P-->>W: false
            W-->>S: no-op (success, no ES call)
        end
    end
Loading

Reviews (1): Last reviewed commit: "No-op indexer jobs for hard-deleted audi..." | Re-trigger Greptile

Comment on lines +60 to 72
rescue ActiveRecord::RecordNotFound
# For hard-deleted models a row already gone when its index/update job runs
# is normal: the destroy's own after_commit enqueued the delete that removes
# the document, so retrying would never succeed. Everything else — including
# a freshly written row not yet visible on a lagging replica — keeps the
# retries that exist to ride the lag out, and the primary re-check stops lag
# from masquerading as a deletion.
raise unless params["class_name"].in?(HARD_DELETED_CLASSES)

ActiveRecord::Base.connection.stick_to_primary!
searched_id = params["record_id"]
raise if searched_id.nil? || params.fetch("class_name").constantize.exists?(searched_id)
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Rescue scope silently covers update_by_query path

The method-level rescue catches ActiveRecord::RecordNotFound raised anywhere in perform, including inside perform_update_by_query (which calls klass.find(params.fetch("source_record_id")) at line 91). If an AudienceMember source_record_id is hard-deleted, the rescue proceeds into the primary-check branch. It re-raises because params["record_id"] is absent (key is source_record_id), so behaviour is correct today. However, the nil-guard being the only thing preventing a silent no-op for update_by_query jobs is an undocumented invariant — a future caller that passes both source_record_id and record_id in params would silently drop the job. A brief inline note that the searched_id.nil? guard also covers the update_by_query path would make this invariant explicit.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant