fix(runner): add Redis-based abort polling for multi-replica runners#5703
Merged
Conversation
When runners have replicas > 1, pods sit behind a K8s Service that load-balances requests. The HTTP-based tRPC abort call may hit the wrong pod, leaving the task running. This adds Redis polling for the abort flag (already set by Jobs) when RUNNER_CONFLICT_RESOLUTION_MODE is REDIS, matching the existing lambda-runner pattern. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
rossmcewan
approved these changes
Mar 25, 2026
TBonnin
reviewed
Mar 25, 2026
| logger.error('Error checking abort flag', { taskId, error: err }); | ||
| } | ||
| }, abortCheckIntervalMs) | ||
| : null; |
Collaborator
There was a problem hiding this comment.
we are adding a setInterval and redis query for each task every 1000ms. Can it be a problem resource-wise and redis-wise?
Contributor
Author
There was a problem hiding this comment.
From my understanding this is the same interval that we use for lambdas, so we already have a consistent workload coming from there with this interval, the server load seems pretty low.
So, regardless if we start using this in the runners all of this workload will appear eventually from our lambdas (once we do the full migration)
Still this heartbeat is only enabled for runners that have more than one replica.
I was going to check how much extra load enabling would be coming from runners having replica > 1, ill try to gather some data.
TBonnin
approved these changes
Mar 25, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When runners have replicas > 1, pods sit behind a K8s Service that load-balances requests. The HTTP-based tRPC abort call may hit the wrong pod, leaving the task running. This adds Redis polling for the abort flag (already set by Jobs) when RUNNER_CONFLICT_RESOLUTION_MODE is REDIS, matching the existing lambda-runner pattern.
It also centralizes the KV store instance and exposes flags to control abort polling in the runner server.
This summary was automatically generated by @propel-code-bot