Skip to content

BinaryWorkflowTagStorage.getWorkflowIdsPage() can silently skip workflow IDs with Redis backend #295

@Ouriel

Description

@Ouriel

Bug Description

BinaryWorkflowTagStorage.getWorkflowIdsPage() can permanently skip workflow IDs during paginated fanout when using Redis storage. The root cause is that the internal accumulation loop always passes the full limit to keySetStorage.getPage() instead of the remaining deficit (limit - workflowIds.size).

Affected Code

File: infinitic-workflow-tag/src/main/kotlin/io/infinitic/workflows/tag/storage/BinaryWorkflowTagStorage.kt

do {
  val page = keySetStorage.getPage(key, limit, nextCursor)  // BUG: should be (limit - workflowIds.size)
  page.values.forEach { workflowIds.add(WorkflowId(String(it))) }
  nextCursor = page.nextCursor
} while (workflowIds.size < limit && nextCursor != null)

return WorkflowIdsPage(
    workflowIds = workflowIds.take(limit),
    nextCursor = nextCursor,
)

Root Cause

  1. RedisKeySetStorage.getPage() may call Redis SSCAN multiple times internally. SSCAN can return duplicate members during hash table rehashing.
  2. RedisKeySetStorage.getPage() does not deduplicate — it returns raw results (at most limit byte arrays, but some may be duplicates).
  3. BinaryWorkflowTagStorage adds results to a LinkedHashSet<WorkflowId>, which deduplicates. If N duplicates were returned, workflowIds.size = limit - N, which is less than limit.
  4. The loop continues and calls getPage(key, limit, nextCursor) again with the full limit instead of the deficit N.
  5. Redis returns up to limit more items. If the remaining items in the set are fewer than limit, all remaining items are returned and nextCursor becomes null.
  6. The LinkedHashSet now contains more than limit unique IDs. take(limit) truncates the excess, and nextCursor = null signals end of pagination.
  7. The truncated IDs are permanently lost — the caller will never request them because nextCursor is null.

Reproduction Scenario

  • Redis storage backend
  • A tag associated with 7,000 workflow IDs
  • fanoutPageSize = 5000 (default)
  • Redis SSCAN returns duplicates during rehashing

Step-by-step trace:

Call: getWorkflowIdsPage(tag, name, limit=5000, cursor=null)

  Iteration 1:
    keySetStorage.getPage(key, limit=5000, cursor=null)
    → Returns 5000 byte arrays, 10 are duplicates of each other
    → LinkedHashSet size = 4990 unique IDs
    → nextCursor = "cursorA"

  Loop check: 4990 < 5000 AND cursorA != null → continue

  Iteration 2:
    keySetStorage.getPage(key, limit=5000, cursor="cursorA")  ← should be limit=10
    → Redis returns all 2000 remaining items, nextCursor = null
    → LinkedHashSet size = 4990 + 2000 = 6990

  Loop check: 6990 < 5000? NO → exit

  Return: take(5000) = 5000 IDs, nextCursor = null

  Result: 1990 workflow IDs are permanently skipped.
  The caller (continueFanout) sees nextCursor=null and stops paginating.

Impact

When this bug triggers during continueFanout(), some workflow IDs will not receive the fanout message (dispatch, cancel, signal, retry, complete-timers). The bug is completely silent — no log, no error, no metric. The only observable symptom is that some workflows addressed by tag don't receive the expected command.

Suggested Fix

Change limit to limit - workflowIds.size in the getPage() call:

do {
  val page = keySetStorage.getPage(key, limit - workflowIds.size, nextCursor)
  page.values.forEach { workflowIds.add(WorkflowId(String(it))) }
  nextCursor = page.nextCursor
} while (workflowIds.size < limit && nextCursor != null)

This ensures the second iteration only requests the exact number of items needed to fill the page, so the cursor does not advance past items that will be discarded by take(limit).

Affected Versions

v0.18.3 (introduced with the paginated fanout feature). MySQL and InMemory backends are not affected because they do not return duplicates from getPage().

Note: This bug was identified through AI-assisted code analysis (Kiro) while reviewing the v0.18.3 release changes for impact on the tag-engine batch feature. No production incident triggered this finding — it was discovered by tracing the code paths of the new paginated fanout logic against the Redis SSCAN behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions