Implement SERIAL_LATEST_ONLY run strategy#211
Conversation
| case LAST_ONLY: | ||
| return null; // no queueing support | ||
| case SERIAL_LATEST_ONLY: | ||
| return dequeueWorkflowInstances(workflowId, 1, false); |
There was a problem hiding this comment.
can we add a comment so it is more clear what 1, false mean here?
There was a problem hiding this comment.
Its fairly straightforward, In dequeueWorkflowInstances, 1 refers to the concurrency value for SERIAL_LATEST_ONLY (only 1 instance can run), and strict=false indicates that this is not a strict* run strategy, i.e it doesn't depend on the past failure.
| return insertInstance(conn, instance, true, null, messages); | ||
| } | ||
|
|
||
| private int[] startSerialLatestOnlyInstances( |
There was a problem hiding this comment.
[nice to have, where appropriate, not blocking] this could benefit from some javadoc and inline comments, mainly outlining the contracts (input and output) and some details on the implementation approach taken , as well as comments throughout to describe why we're where it's reasonable to expect a different implementation -- from just reading, it's not always clear, even if consistent with other code
| assertNull(ret); | ||
| ret = runStrategyDao.dequeueWithRunStrategy(TEST_WORKFLOW_ID, RunStrategy.create("LAST_ONLY")); | ||
| assertNull(ret); | ||
| ret = |
There was a problem hiding this comment.
any way to add tests cases for the complex scenarios we've discussed on the slack thread? this will also clarify the intended behavior for the future maintainer, as it's not an intuitive decision one way or the other
There was a problem hiding this comment.
added missing batch path test coverage
… batch path test coverage
| * reaches a terminal state, only the latest queued instance is dequeued to run while every | ||
| * older queued instance is stopped. The running instance is never terminated by a new arrival. | ||
| */ | ||
| SERIAL_LATEST_ONLY; |
There was a problem hiding this comment.
while every older queued instance is stopped - this makes it seems that it is not the eager collapse algorithm. We can correct the statement to say the older queued instances stop prior to that
There was a problem hiding this comment.
updated the doc to make it clear
| private static final String SERIAL_LATEST_ONLY_TIMELINE_TEMPLATE = | ||
| "[\"With SERIAL_LATEST_ONLY run strategy, this run is stopped because new instance %s with run %s arrived.\"]"; | ||
|
|
||
| private static final String STOP_SERIAL_LATEST_ONLY_QUEUED_INSTANCE_QUERY = |
There was a problem hiding this comment.
We need to fix maestroWorkflowDao.java 803-811 logic to disallow switches to this run strategy where more are queued - it will be better to disallow it then having behavior which user may not want (like many queued and even with run strategy switch they queued instances are not drained on first go)
There was a problem hiding this comment.
The current behavior is aligned with the Sequential run strategy semantics described in the PR:
Switching from PARALLEL to SERIAL_LATEST_ONLY follows the same semantics as switching from PARALLEL to SEQUENTIAL — running instances drain naturally and existing queued instances are left untouched until a new arrival triggers the collapse.
With this behavior, all existing queued instances will automatically collapse to last_only as soon as a new instance is enqueued.
I think we should keep it this way, since it closely mirrors the behavior and transition semantics of the existing Sequential run strategy.
There was a problem hiding this comment.
Sounds good, I thought it might be safe to disallow from get go but this might not be a common pattern and can be fixed easily later if needed, so it should be good to go with this.
| case LAST_ONLY: | ||
| return null; // no queueing support | ||
| case SERIAL_LATEST_ONLY: | ||
| return dequeueWorkflowInstances(workflowId, 1, false); |
There was a problem hiding this comment.
Can we pass runStrategy.getWorkflowConcurrency() to be consistent with the others and not hardcode 1 here? This way if we ever dynamically change this, like for example through Actions.java (properties update path )when we want to pause using concurrency updates we can pause a SLO workflow
There was a problem hiding this comment.
makes sense and merge it with the STRICT_SEQUENTIAL switch statement
| } | ||
|
|
||
| @Test | ||
| public void testStartRunStrategyWithSerialLatestOnlyHonorsRestart() { |
There was a problem hiding this comment.
since restarting an older instance stops a newer queued instance it is slightly surprising behavior - we have to get confirmation from our internal use-case if that's fine as it could genuinely lead to loss of processing latest data, let's confirm before we get this PR change in
There was a problem hiding this comment.
Yes, this is one of the behaviors called out in the PR description and is consistent with how other run strategies handle restarts. For example:
last_only: restarting an instance can stop the currently running instance.first_only: a restarted instance may be automatically terminated.parallel: a restarted instance may still be queued if the concurrency limit has already been reached, regardless of whether it represents the most recent run.
I think this is reasonable because a restart is an explicit manual action, so clearly documenting the side effect should be sufficient.
Restarting an older instance produces a new run that is subject to the full SLO contract — any currently queued instance is stopped and the restart becomes the new queued candidate.
There was a problem hiding this comment.
I checked with our internal use case, this restart behavior is acceptable.
| } | ||
|
|
||
| @Test | ||
| public void testStartRunStrategyWithSerialLatestOnly() { |
There was a problem hiding this comment.
before we merge this: the dequeue tests cover the decision in isolation with mocks, but we don't have anything
exercising the actual running → terminate → re-dequeue handoff. Could we add a DAO-level test
that marks instance 1 running, enqueues 2, asserts dequeue returns nothing, then terminates
1 and asserts dequeue now returns 2? Another variant would seed a second queued row and assert the latest survives the collapse.
Mentioned in above comment but if you enqueue a backlog under SEQUENTIAL and flip to SERIAL_LATEST_ONLY with nothing running, dequeue drains oldest-first instead of collapsing until the next arrival. We can add a test here with that behavior if we intend to keep it but my vote would be to disallow it in the code and then assert that it is diallowed here
| * running instance reaches a terminal state, the single queued instance is dequeued to run. The | ||
| * running instance is never terminated by a new arrival. | ||
| */ | ||
| SERIAL_LATEST_ONLY; |
There was a problem hiding this comment.
Since the two other strategies we are bridging are called SERIAL and LAST_ONLY, would it be more consistent to call this one SERIAL_LAST_ONLY instead of latest?
There was a problem hiding this comment.
I chose latest_only instead of last_only for the same reason we use the serial prefix instead of sequential. Both sequential and last_only already have established meanings in the system, so this naming avoids confusion and keeps things consistent.
praneethy91
left a comment
There was a problem hiding this comment.
Thanks for the improvements, overall looks good with good test coverage.
| private static final String SERIAL_LATEST_ONLY_TIMELINE_TEMPLATE = | ||
| "[\"With SERIAL_LATEST_ONLY run strategy, this run is stopped because new instance %s with run %s arrived.\"]"; | ||
|
|
||
| private static final String STOP_SERIAL_LATEST_ONLY_QUEUED_INSTANCE_QUERY = |
There was a problem hiding this comment.
Sounds good, I thought it might be safe to disallow from get go but this might not be a common pattern and can be fixed easily later if needed, so it should be good to go with this.
Pull Request type
./gradlew build --write-locksto refresh dependencies)NOTE: Please remember to run
./gradlew spotlessApplyto fix any format violations.Changes in this PR
SERIAL_LATEST_ONLY Run Strategy
Adds a new run strategy that fills the gap between SEQUENTIAL and LAST_ONLY. It provides serial execution (one instance at a time) with eager stale-queue collapse, i.e when a new instance arrives, all existing queued instances are stopped and the new one becomes the sole queued candidate. The running instance is never terminated by a new arrival.
This is useful for workflows where only the most recent trigger state matters and replaying a stale backlog is wasteful, for example, periodic refresh, snapshot publish, or materialized-view rebuild jobs. The strategy is analogous to Airflow's
schedule + catchup=False + max_active_runs=1combination.Switching into SERIAL_LATEST_ONLY follows the same free-switch behavior as SEQUENTIAL, i.e no gate on the number of non-terminal instances. In-flight instances drain naturally after a strategy change, and the SLO contract takes effect on subsequent arrivals.
Tested locally. A few behaviors worth noting: