Skip to content

fix(runner): add Redis-based abort polling for multi-replica runners#5703

Merged
pfreixes merged 1 commit into
masterfrom
worktree-runner-cancellation
Mar 25, 2026
Merged

fix(runner): add Redis-based abort polling for multi-replica runners#5703
pfreixes merged 1 commit into
masterfrom
worktree-runner-cancellation

Conversation

@pfreixes
Copy link
Copy Markdown
Contributor

@pfreixes pfreixes commented Mar 25, 2026

When runners have replicas > 1, pods sit behind a K8s Service that load-balances requests. The HTTP-based tRPC abort call may hit the wrong pod, leaving the task running. This adds Redis polling for the abort flag (already set by Jobs) when RUNNER_CONFLICT_RESOLUTION_MODE is REDIS, matching the existing lambda-runner pattern.


It also centralizes the KV store instance and exposes flags to control abort polling in the runner server.


This summary was automatically generated by @propel-code-bot

When runners have replicas > 1, pods sit behind a K8s Service that
load-balances requests. The HTTP-based tRPC abort call may hit the
wrong pod, leaving the task running. This adds Redis polling for the
abort flag (already set by Jobs) when RUNNER_CONFLICT_RESOLUTION_MODE
is REDIS, matching the existing lambda-runner pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@propel-code-bot propel-code-bot Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review found no issues with the Redis-based abort polling changes.

Status: No Issues Found | Risk: Low

Review Details

📁 2 files reviewed | 💬 0 comments

Instruction Files
└── .claude/
    ├── agents/
    │   └── nango-docs-migrator.md
    └── skills

logger.error('Error checking abort flag', { taskId, error: err });
}
}, abortCheckIntervalMs)
: null;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are adding a setInterval and redis query for each task every 1000ms. Can it be a problem resource-wise and redis-wise?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding this is the same interval that we use for lambdas, so we already have a consistent workload coming from there with this interval, the server load seems pretty low.

So, regardless if we start using this in the runners all of this workload will appear eventually from our lambdas (once we do the full migration)

Still this heartbeat is only enabled for runners that have more than one replica.

I was going to check how much extra load enabling would be coming from runners having replica > 1, ill try to gather some data.

@pfreixes pfreixes added this pull request to the merge queue Mar 25, 2026
Merged via the queue into master with commit 5098c51 Mar 25, 2026
25 checks passed
@pfreixes pfreixes deleted the worktree-runner-cancellation branch March 25, 2026 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants