BinaryWorkflowTagStorage.getWorkflowIdsPage() can silently skip workflow IDs with Redis backend

## Bug Description

`BinaryWorkflowTagStorage.getWorkflowIdsPage()` can permanently skip workflow IDs during paginated fanout when using Redis storage. The root cause is that the internal accumulation loop always passes the full `limit` to `keySetStorage.getPage()` instead of the remaining deficit (`limit - workflowIds.size`).

## Affected Code

**File:** `infinitic-workflow-tag/src/main/kotlin/io/infinitic/workflows/tag/storage/BinaryWorkflowTagStorage.kt`

```kotlin
do {
  val page = keySetStorage.getPage(key, limit, nextCursor)  // BUG: should be (limit - workflowIds.size)
  page.values.forEach { workflowIds.add(WorkflowId(String(it))) }
  nextCursor = page.nextCursor
} while (workflowIds.size < limit && nextCursor != null)

return WorkflowIdsPage(
    workflowIds = workflowIds.take(limit),
    nextCursor = nextCursor,
)
```

## Root Cause

1. `RedisKeySetStorage.getPage()` may call Redis SSCAN multiple times internally. SSCAN can return duplicate members during hash table rehashing.
2. `RedisKeySetStorage.getPage()` does not deduplicate — it returns raw results (at most `limit` byte arrays, but some may be duplicates).
3. `BinaryWorkflowTagStorage` adds results to a `LinkedHashSet<WorkflowId>`, which deduplicates. If N duplicates were returned, `workflowIds.size` = `limit - N`, which is less than `limit`.
4. The loop continues and calls `getPage(key, limit, nextCursor)` again with the **full `limit`** instead of the deficit `N`.
5. Redis returns up to `limit` more items. If the remaining items in the set are fewer than `limit`, all remaining items are returned and `nextCursor` becomes `null`.
6. The `LinkedHashSet` now contains more than `limit` unique IDs. `take(limit)` truncates the excess, and `nextCursor = null` signals end of pagination.
7. **The truncated IDs are permanently lost** — the caller will never request them because `nextCursor` is `null`.

## Reproduction Scenario

- Redis storage backend
- A tag associated with 7,000 workflow IDs
- `fanoutPageSize = 5000` (default)
- Redis SSCAN returns duplicates during rehashing

**Step-by-step trace:**

```
Call: getWorkflowIdsPage(tag, name, limit=5000, cursor=null)

  Iteration 1:
    keySetStorage.getPage(key, limit=5000, cursor=null)
    → Returns 5000 byte arrays, 10 are duplicates of each other
    → LinkedHashSet size = 4990 unique IDs
    → nextCursor = "cursorA"

  Loop check: 4990 < 5000 AND cursorA != null → continue

  Iteration 2:
    keySetStorage.getPage(key, limit=5000, cursor="cursorA")  ← should be limit=10
    → Redis returns all 2000 remaining items, nextCursor = null
    → LinkedHashSet size = 4990 + 2000 = 6990

  Loop check: 6990 < 5000? NO → exit

  Return: take(5000) = 5000 IDs, nextCursor = null

  Result: 1990 workflow IDs are permanently skipped.
  The caller (continueFanout) sees nextCursor=null and stops paginating.
```

## Impact

When this bug triggers during `continueFanout()`, some workflow IDs will not receive the fanout message (dispatch, cancel, signal, retry, complete-timers). The bug is **completely silent** — no log, no error, no metric. The only observable symptom is that some workflows addressed by tag don't receive the expected command.

## Suggested Fix

Change `limit` to `limit - workflowIds.size` in the `getPage()` call:

```kotlin
do {
  val page = keySetStorage.getPage(key, limit - workflowIds.size, nextCursor)
  page.values.forEach { workflowIds.add(WorkflowId(String(it))) }
  nextCursor = page.nextCursor
} while (workflowIds.size < limit && nextCursor != null)
```

This ensures the second iteration only requests the exact number of items needed to fill the page, so the cursor does not advance past items that will be discarded by `take(limit)`.

## Affected Versions

v0.18.3 (introduced with the paginated fanout feature). MySQL and InMemory backends are not affected because they do not return duplicates from `getPage()`.

Note: This bug was identified through AI-assisted code analysis (Kiro) while reviewing the v0.18.3 release changes for impact on the tag-engine batch feature. No production incident triggered this finding — it was discovered by tracing the code paths of the new paginated fanout logic against the Redis SSCAN behavior.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BinaryWorkflowTagStorage.getWorkflowIdsPage() can silently skip workflow IDs with Redis backend #295

Bug Description

Affected Code

Root Cause

Reproduction Scenario

Impact

Suggested Fix

Affected Versions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

BinaryWorkflowTagStorage.getWorkflowIdsPage() can silently skip workflow IDs with Redis backend #295

Description

Bug Description

Affected Code

Root Cause

Reproduction Scenario

Impact

Suggested Fix

Affected Versions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions