Skip to content

Implement SERIAL_LATEST_ONLY run strategy#211

Merged
akashdw merged 3 commits into
mainfrom
adwivedi/slo
Jun 2, 2026
Merged

Implement SERIAL_LATEST_ONLY run strategy#211
akashdw merged 3 commits into
mainfrom
adwivedi/slo

Conversation

@akashdw

@akashdw akashdw commented May 28, 2026

Copy link
Copy Markdown
Collaborator

Pull Request type

  • Bugfix
  • [ x] Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes (Please run ./gradlew build --write-locks to refresh dependencies)
  • Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

SERIAL_LATEST_ONLY Run Strategy

Adds a new run strategy that fills the gap between SEQUENTIAL and LAST_ONLY. It provides serial execution (one instance at a time) with eager stale-queue collapse, i.e when a new instance arrives, all existing queued instances are stopped and the new one becomes the sole queued candidate. The running instance is never terminated by a new arrival.

This is useful for workflows where only the most recent trigger state matters and replaying a stale backlog is wasteful, for example, periodic refresh, snapshot publish, or materialized-view rebuild jobs. The strategy is analogous to Airflow's schedule + catchup=False + max_active_runs=1 combination.

Switching into SERIAL_LATEST_ONLY follows the same free-switch behavior as SEQUENTIAL, i.e no gate on the number of non-terminal instances. In-flight instances drain naturally after a strategy change, and the SLO contract takes effect on subsequent arrivals.

Tested locally. A few behaviors worth noting:

  • Switching from PARALLEL to SERIAL_LATEST_ONLY follows the same semantics as switching from PARALLEL to SEQUENTIAL — running instances drain naturally and existing queued instances are left untouched until a new arrival triggers the collapse.
  • Restarting an older instance produces a new run that is subject to the full SLO contract — any currently queued instance is stopped and the restart becomes the new queued candidate.

@akashdw akashdw requested a review from praneethy91 May 28, 2026 19:51
case LAST_ONLY:
return null; // no queueing support
case SERIAL_LATEST_ONLY:
return dequeueWorkflowInstances(workflowId, 1, false);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment so it is more clear what 1, false mean here?

@akashdw akashdw May 29, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its fairly straightforward, In dequeueWorkflowInstances, 1 refers to the concurrency value for SERIAL_LATEST_ONLY (only 1 instance can run), and strict=false indicates that this is not a strict* run strategy, i.e it doesn't depend on the past failure.

return insertInstance(conn, instance, true, null, messages);
}

private int[] startSerialLatestOnlyInstances(

@ykitaev ykitaev May 29, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nice to have, where appropriate, not blocking] this could benefit from some javadoc and inline comments, mainly outlining the contracts (input and output) and some details on the implementation approach taken , as well as comments throughout to describe why we're where it's reasonable to expect a different implementation -- from just reading, it's not always clear, even if consistent with other code

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

assertNull(ret);
ret = runStrategyDao.dequeueWithRunStrategy(TEST_WORKFLOW_ID, RunStrategy.create("LAST_ONLY"));
assertNull(ret);
ret =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to add tests cases for the complex scenarios we've discussed on the slack thread? this will also clarify the intended behavior for the future maintainer, as it's not an intuitive decision one way or the other

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added missing batch path test coverage

* reaches a terminal state, only the latest queued instance is dequeued to run while every
* older queued instance is stopped. The running instance is never terminated by a new arrival.
*/
SERIAL_LATEST_ONLY;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

while every older queued instance is stopped - this makes it seems that it is not the eager collapse algorithm. We can correct the statement to say the older queued instances stop prior to that

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the doc to make it clear

private static final String SERIAL_LATEST_ONLY_TIMELINE_TEMPLATE =
"[\"With SERIAL_LATEST_ONLY run strategy, this run is stopped because new instance %s with run %s arrived.\"]";

private static final String STOP_SERIAL_LATEST_ONLY_QUEUED_INSTANCE_QUERY =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to fix maestroWorkflowDao.java 803-811 logic to disallow switches to this run strategy where more are queued - it will be better to disallow it then having behavior which user may not want (like many queued and even with run strategy switch they queued instances are not drained on first go)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behavior is aligned with the Sequential run strategy semantics described in the PR:

Switching from PARALLEL to SERIAL_LATEST_ONLY follows the same semantics as switching from PARALLEL to SEQUENTIAL — running instances drain naturally and existing queued instances are left untouched until a new arrival triggers the collapse.

With this behavior, all existing queued instances will automatically collapse to last_only as soon as a new instance is enqueued.

I think we should keep it this way, since it closely mirrors the behavior and transition semantics of the existing Sequential run strategy.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I thought it might be safe to disallow from get go but this might not be a common pattern and can be fixed easily later if needed, so it should be good to go with this.

case LAST_ONLY:
return null; // no queueing support
case SERIAL_LATEST_ONLY:
return dequeueWorkflowInstances(workflowId, 1, false);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass runStrategy.getWorkflowConcurrency() to be consistent with the others and not hardcode 1 here? This way if we ever dynamically change this, like for example through Actions.java (properties update path )when we want to pause using concurrency updates we can pause a SLO workflow

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense and merge it with the STRICT_SEQUENTIAL switch statement

}

@Test
public void testStartRunStrategyWithSerialLatestOnlyHonorsRestart() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since restarting an older instance stops a newer queued instance it is slightly surprising behavior - we have to get confirmation from our internal use-case if that's fine as it could genuinely lead to loss of processing latest data, let's confirm before we get this PR change in

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one of the behaviors called out in the PR description and is consistent with how other run strategies handle restarts. For example:

  • last_only: restarting an instance can stop the currently running instance.
  • first_only: a restarted instance may be automatically terminated.
  • parallel: a restarted instance may still be queued if the concurrency limit has already been reached, regardless of whether it represents the most recent run.

I think this is reasonable because a restart is an explicit manual action, so clearly documenting the side effect should be sufficient.

Restarting an older instance produces a new run that is subject to the full SLO contract — any currently queued instance is stopped and the restart becomes the new queued candidate.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with our internal use case, this restart behavior is acceptable.

}

@Test
public void testStartRunStrategyWithSerialLatestOnly() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before we merge this: the dequeue tests cover the decision in isolation with mocks, but we don't have anything
exercising the actual running → terminate → re-dequeue handoff. Could we add a DAO-level test
that marks instance 1 running, enqueues 2, asserts dequeue returns nothing, then terminates
1 and asserts dequeue now returns 2? Another variant would seed a second queued row and assert the latest survives the collapse.

Mentioned in above comment but if you enqueue a backlog under SEQUENTIAL and flip to SERIAL_LATEST_ONLY with nothing running, dequeue drains oldest-first instead of collapsing until the next arrival. We can add a test here with that behavior if we intend to keep it but my vote would be to disallow it in the code and then assert that it is diallowed here

* running instance reaches a terminal state, the single queued instance is dequeued to run. The
* running instance is never terminated by a new arrival.
*/
SERIAL_LATEST_ONLY;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the two other strategies we are bridging are called SERIAL and LAST_ONLY, would it be more consistent to call this one SERIAL_LAST_ONLY instead of latest?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose latest_only instead of last_only for the same reason we use the serial prefix instead of sequential. Both sequential and last_only already have established meanings in the system, so this naming avoids confusion and keeps things consistent.

@praneethy91 praneethy91 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements, overall looks good with good test coverage.

private static final String SERIAL_LATEST_ONLY_TIMELINE_TEMPLATE =
"[\"With SERIAL_LATEST_ONLY run strategy, this run is stopped because new instance %s with run %s arrived.\"]";

private static final String STOP_SERIAL_LATEST_ONLY_QUEUED_INSTANCE_QUERY =

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I thought it might be safe to disallow from get go but this might not be a common pattern and can be fixed easily later if needed, so it should be good to go with this.

@akashdw akashdw merged commit d57170d into main Jun 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants