Bug Description
BinaryWorkflowTagStorage.getWorkflowIdsPage() can permanently skip workflow IDs during paginated fanout when using Redis storage. The root cause is that the internal accumulation loop always passes the full limit to keySetStorage.getPage() instead of the remaining deficit (limit - workflowIds.size).
Affected Code
File: infinitic-workflow-tag/src/main/kotlin/io/infinitic/workflows/tag/storage/BinaryWorkflowTagStorage.kt
do {
val page = keySetStorage.getPage(key, limit, nextCursor) // BUG: should be (limit - workflowIds.size)
page.values.forEach { workflowIds.add(WorkflowId(String(it))) }
nextCursor = page.nextCursor
} while (workflowIds.size < limit && nextCursor != null)
return WorkflowIdsPage(
workflowIds = workflowIds.take(limit),
nextCursor = nextCursor,
)
Root Cause
RedisKeySetStorage.getPage() may call Redis SSCAN multiple times internally. SSCAN can return duplicate members during hash table rehashing.
RedisKeySetStorage.getPage() does not deduplicate — it returns raw results (at most limit byte arrays, but some may be duplicates).
BinaryWorkflowTagStorage adds results to a LinkedHashSet<WorkflowId>, which deduplicates. If N duplicates were returned, workflowIds.size = limit - N, which is less than limit.
- The loop continues and calls
getPage(key, limit, nextCursor) again with the full limit instead of the deficit N.
- Redis returns up to
limit more items. If the remaining items in the set are fewer than limit, all remaining items are returned and nextCursor becomes null.
- The
LinkedHashSet now contains more than limit unique IDs. take(limit) truncates the excess, and nextCursor = null signals end of pagination.
- The truncated IDs are permanently lost — the caller will never request them because
nextCursor is null.
Reproduction Scenario
- Redis storage backend
- A tag associated with 7,000 workflow IDs
fanoutPageSize = 5000 (default)
- Redis SSCAN returns duplicates during rehashing
Step-by-step trace:
Call: getWorkflowIdsPage(tag, name, limit=5000, cursor=null)
Iteration 1:
keySetStorage.getPage(key, limit=5000, cursor=null)
→ Returns 5000 byte arrays, 10 are duplicates of each other
→ LinkedHashSet size = 4990 unique IDs
→ nextCursor = "cursorA"
Loop check: 4990 < 5000 AND cursorA != null → continue
Iteration 2:
keySetStorage.getPage(key, limit=5000, cursor="cursorA") ← should be limit=10
→ Redis returns all 2000 remaining items, nextCursor = null
→ LinkedHashSet size = 4990 + 2000 = 6990
Loop check: 6990 < 5000? NO → exit
Return: take(5000) = 5000 IDs, nextCursor = null
Result: 1990 workflow IDs are permanently skipped.
The caller (continueFanout) sees nextCursor=null and stops paginating.
Impact
When this bug triggers during continueFanout(), some workflow IDs will not receive the fanout message (dispatch, cancel, signal, retry, complete-timers). The bug is completely silent — no log, no error, no metric. The only observable symptom is that some workflows addressed by tag don't receive the expected command.
Suggested Fix
Change limit to limit - workflowIds.size in the getPage() call:
do {
val page = keySetStorage.getPage(key, limit - workflowIds.size, nextCursor)
page.values.forEach { workflowIds.add(WorkflowId(String(it))) }
nextCursor = page.nextCursor
} while (workflowIds.size < limit && nextCursor != null)
This ensures the second iteration only requests the exact number of items needed to fill the page, so the cursor does not advance past items that will be discarded by take(limit).
Affected Versions
v0.18.3 (introduced with the paginated fanout feature). MySQL and InMemory backends are not affected because they do not return duplicates from getPage().
Note: This bug was identified through AI-assisted code analysis (Kiro) while reviewing the v0.18.3 release changes for impact on the tag-engine batch feature. No production incident triggered this finding — it was discovered by tracing the code paths of the new paginated fanout logic against the Redis SSCAN behavior.
Bug Description
BinaryWorkflowTagStorage.getWorkflowIdsPage()can permanently skip workflow IDs during paginated fanout when using Redis storage. The root cause is that the internal accumulation loop always passes the fulllimittokeySetStorage.getPage()instead of the remaining deficit (limit - workflowIds.size).Affected Code
File:
infinitic-workflow-tag/src/main/kotlin/io/infinitic/workflows/tag/storage/BinaryWorkflowTagStorage.ktRoot Cause
RedisKeySetStorage.getPage()may call Redis SSCAN multiple times internally. SSCAN can return duplicate members during hash table rehashing.RedisKeySetStorage.getPage()does not deduplicate — it returns raw results (at mostlimitbyte arrays, but some may be duplicates).BinaryWorkflowTagStorageadds results to aLinkedHashSet<WorkflowId>, which deduplicates. If N duplicates were returned,workflowIds.size=limit - N, which is less thanlimit.getPage(key, limit, nextCursor)again with the fulllimitinstead of the deficitN.limitmore items. If the remaining items in the set are fewer thanlimit, all remaining items are returned andnextCursorbecomesnull.LinkedHashSetnow contains more thanlimitunique IDs.take(limit)truncates the excess, andnextCursor = nullsignals end of pagination.nextCursorisnull.Reproduction Scenario
fanoutPageSize = 5000(default)Step-by-step trace:
Impact
When this bug triggers during
continueFanout(), some workflow IDs will not receive the fanout message (dispatch, cancel, signal, retry, complete-timers). The bug is completely silent — no log, no error, no metric. The only observable symptom is that some workflows addressed by tag don't receive the expected command.Suggested Fix
Change
limittolimit - workflowIds.sizein thegetPage()call:This ensures the second iteration only requests the exact number of items needed to fill the page, so the cursor does not advance past items that will be discarded by
take(limit).Affected Versions
v0.18.3 (introduced with the paginated fanout feature). MySQL and InMemory backends are not affected because they do not return duplicates from
getPage().Note: This bug was identified through AI-assisted code analysis (Kiro) while reviewing the v0.18.3 release changes for impact on the tag-engine batch feature. No production incident triggered this finding — it was discovered by tracing the code paths of the new paginated fanout logic against the Redis SSCAN behavior.