Skip to content

logmq: handle duplicate log messages idempotently #947

@alexluong

Description

@alexluong

Root-cause fix for the insert failure in #946.

Problem

The log pipeline handles at-least-once redelivery idempotently across batches (store upserts; alert counting dedups by attempt ID via SADD). Duplicates landing within the same batch are unhandled:

  • pglogstore: attempts use ON CONFLICT (time, id) DO UPDATE; Postgres rejects the same key twice in one statement → ON CONFLICT DO UPDATE command cannot affect row a second time (SQLSTATE 21000) → the whole batch fails, nacks, redelivers, and recurs (logmq-consumer batch insert fails with SQLSTATE 21000 on duplicate SQS delivery, then consumer dies permanently #946 covers the escalation to a dead consumer).
  • chlogstore: no error, but redundant rows (merge overhead, transient duplicate reads until ReplacingMergeTree merges).
  • alert evaluation: runs in full once per copy instead of once per attempt.

Duplicates arise from MQ redelivery and from producer re-publish (deliverymq retrying after a post-publish error), so they can carry different MQ message IDs — dedup must key on Attempt.ID.

Proposal

a) BatchProcessor (processBatch in internal/logmq/batchprocessor.go): dedup by Attempt.ID while parsing; on duplicate, ack immediately and drop. Safe because copies are byte-identical and the kept copy stays un-acked until persisted — the at-least-once guarantee rides on it. Keeps the existing 1:1 entry↔message pairing; alert eval runs once per unique attempt per batch.

b) LogStore contract: InsertMany must tolerate intra-batch duplicates regardless of caller. pglogstore currently violates this (the 21000) — exact approach TBD: dedup entries before insert, restructure the ON CONFLICT usage, or accept the conflict if that's expressible safely. chlogstore: best-effort dedup to avoid redundant rows. memlogstore already conforms (keyed upsert).

c) Alert replay skip: SAdd already returns whether the attempt was newly counted (internal/alert/store.go discards it today); use it to skip re-evaluating replayed attempts. Design constraint: must preserve partial-failure retry — eval errors are nacked for redelivery, and re-running disable/emit on replay is intentional recovery (internal/alert/monitor.go). A naive early-exit on already-counted would silently drop alerts when eval fails after counting; likely needs a completed-marker or restructuring of when the attempt is recorded.

Tests

  • drivertest contract case (all drivers): batch with duplicate entries inserts without error, one attempt row.
  • batchprocessor: duplicate messages → one entry inserted, all copies acked, alert eval once.
  • alert: replayed attempt skips eval; partial-failure replay still retries.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions