You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The log pipeline handles at-least-once redelivery idempotently across batches (store upserts; alert counting dedups by attempt ID via SADD). Duplicates landing within the same batch are unhandled:
chlogstore: no error, but redundant rows (merge overhead, transient duplicate reads until ReplacingMergeTree merges).
alert evaluation: runs in full once per copy instead of once per attempt.
Duplicates arise from MQ redelivery and from producer re-publish (deliverymq retrying after a post-publish error), so they can carry different MQ message IDs — dedup must key on Attempt.ID.
Proposal
a) BatchProcessor (processBatch in internal/logmq/batchprocessor.go): dedup by Attempt.ID while parsing; on duplicate, ack immediately and drop. Safe because copies are byte-identical and the kept copy stays un-acked until persisted — the at-least-once guarantee rides on it. Keeps the existing 1:1 entry↔message pairing; alert eval runs once per unique attempt per batch.
b) LogStore contract: InsertMany must tolerate intra-batch duplicates regardless of caller. pglogstore currently violates this (the 21000) — exact approach TBD: dedup entries before insert, restructure the ON CONFLICT usage, or accept the conflict if that's expressible safely. chlogstore: best-effort dedup to avoid redundant rows. memlogstore already conforms (keyed upsert).
c) Alert replay skip: SAdd already returns whether the attempt was newly counted (internal/alert/store.go discards it today); use it to skip re-evaluating replayed attempts. Design constraint: must preserve partial-failure retry — eval errors are nacked for redelivery, and re-running disable/emit on replay is intentional recovery (internal/alert/monitor.go). A naive early-exit on already-counted would silently drop alerts when eval fails after counting; likely needs a completed-marker or restructuring of when the attempt is recorded.
Tests
drivertest contract case (all drivers): batch with duplicate entries inserts without error, one attempt row.
batchprocessor: duplicate messages → one entry inserted, all copies acked, alert eval once.
alert: replayed attempt skips eval; partial-failure replay still retries.
Root-cause fix for the insert failure in #946.
Problem
The log pipeline handles at-least-once redelivery idempotently across batches (store upserts; alert counting dedups by attempt ID via
SADD). Duplicates landing within the same batch are unhandled:ON CONFLICT (time, id) DO UPDATE; Postgres rejects the same key twice in one statement →ON CONFLICT DO UPDATE command cannot affect row a second time (SQLSTATE 21000)→ the whole batch fails, nacks, redelivers, and recurs (logmq-consumerbatch insert fails with SQLSTATE 21000 on duplicate SQS delivery, then consumer dies permanently #946 covers the escalation to a dead consumer).Duplicates arise from MQ redelivery and from producer re-publish (deliverymq retrying after a post-publish error), so they can carry different MQ message IDs — dedup must key on
Attempt.ID.Proposal
a) BatchProcessor (
processBatchininternal/logmq/batchprocessor.go): dedup byAttempt.IDwhile parsing; on duplicate, ack immediately and drop. Safe because copies are byte-identical and the kept copy stays un-acked until persisted — the at-least-once guarantee rides on it. Keeps the existing 1:1 entry↔message pairing; alert eval runs once per unique attempt per batch.b) LogStore contract:
InsertManymust tolerate intra-batch duplicates regardless of caller. pglogstore currently violates this (the 21000) — exact approach TBD: dedup entries before insert, restructure theON CONFLICTusage, or accept the conflict if that's expressible safely. chlogstore: best-effort dedup to avoid redundant rows. memlogstore already conforms (keyed upsert).c) Alert replay skip:
SAddalready returns whether the attempt was newly counted (internal/alert/store.godiscards it today); use it to skip re-evaluating replayed attempts. Design constraint: must preserve partial-failure retry — eval errors are nacked for redelivery, and re-running disable/emit on replay is intentional recovery (internal/alert/monitor.go). A naive early-exit on already-counted would silently drop alerts when eval fails after counting; likely needs a completed-marker or restructuring of when the attempt is recorded.Tests