Implement SERIAL_LATEST_ONLY run strategy by akashdw · Pull Request #211 · Netflix/maestro

akashdw · 2026-05-28T19:51:02Z

Pull Request type

Bugfix
[ x] Feature
Refactoring (no functional changes, no api changes)
Build related changes (Please run ./gradlew build --write-locks to refresh dependencies)
Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

SERIAL_LATEST_ONLY Run Strategy

Adds a new run strategy that fills the gap between SEQUENTIAL and LAST_ONLY. It provides serial execution (one instance at a time) with eager stale-queue collapse, i.e when a new instance arrives, all existing queued instances are stopped and the new one becomes the sole queued candidate. The running instance is never terminated by a new arrival.

This is useful for workflows where only the most recent trigger state matters and replaying a stale backlog is wasteful, for example, periodic refresh, snapshot publish, or materialized-view rebuild jobs. The strategy is analogous to Airflow's schedule + catchup=False + max_active_runs=1 combination.

Switching into SERIAL_LATEST_ONLY follows the same free-switch behavior as SEQUENTIAL, i.e no gate on the number of non-terminal instances. In-flight instances drain naturally after a strategy change, and the SLO contract takes effect on subsequent arrivals.

Tested locally. A few behaviors worth noting:

Switching from PARALLEL to SERIAL_LATEST_ONLY follows the same semantics as switching from PARALLEL to SEQUENTIAL — running instances drain naturally and existing queued instances are left untouched until a new arrival triggers the collapse.
Restarting an older instance produces a new run that is subject to the full SLO contract — any currently queued instance is stopped and the restart becomes the new queued candidate.

anjujha · 2026-05-29T01:10:14Z

            case LAST_ONLY:
              return null; // no queueing support
+            case SERIAL_LATEST_ONLY:
+              return dequeueWorkflowInstances(workflowId, 1, false);


can we add a comment so it is more clear what 1, false mean here?

Its fairly straightforward, In dequeueWorkflowInstances, 1 refers to the concurrency value for SERIAL_LATEST_ONLY (only 1 instance can run), and strict=false indicates that this is not a strict* run strategy, i.e it doesn't depend on the past failure.

ykitaev · 2026-05-29T04:29:53Z

+    return insertInstance(conn, instance, true, null, messages);
+  }
+
+  private int[] startSerialLatestOnlyInstances(


[nice to have, where appropriate, not blocking] this could benefit from some javadoc and inline comments, mainly outlining the contracts (input and output) and some details on the implementation approach taken , as well as comments throughout to describe why we're where it's reasonable to expect a different implementation -- from just reading, it's not always clear, even if consistent with other code

ykitaev · 2026-05-29T04:37:25Z

    assertNull(ret);
    ret = runStrategyDao.dequeueWithRunStrategy(TEST_WORKFLOW_ID, RunStrategy.create("LAST_ONLY"));
    assertNull(ret);
+    ret =


any way to add tests cases for the complex scenarios we've discussed on the slack thread? this will also clarify the intended behavior for the future maintainer, as it's not an intuitive decision one way or the other

added missing batch path test coverage

… batch path test coverage

praneethy91 · 2026-05-29T23:21:08Z

+     * reaches a terminal state, only the latest queued instance is dequeued to run while every
+     * older queued instance is stopped. The running instance is never terminated by a new arrival.
+     */
+    SERIAL_LATEST_ONLY;


while every older queued instance is stopped - this makes it seems that it is not the eager collapse algorithm. We can correct the statement to say the older queued instances stop prior to that

updated the doc to make it clear

praneethy91 · 2026-05-30T00:23:27Z

+  private static final String SERIAL_LATEST_ONLY_TIMELINE_TEMPLATE =
+      "[\"With SERIAL_LATEST_ONLY run strategy, this run is stopped because new instance %s with run %s arrived.\"]";
+
+  private static final String STOP_SERIAL_LATEST_ONLY_QUEUED_INSTANCE_QUERY =


We need to fix maestroWorkflowDao.java 803-811 logic to disallow switches to this run strategy where more are queued - it will be better to disallow it then having behavior which user may not want (like many queued and even with run strategy switch they queued instances are not drained on first go)

The current behavior is aligned with the Sequential run strategy semantics described in the PR:

Switching from PARALLEL to SERIAL_LATEST_ONLY follows the same semantics as switching from PARALLEL to SEQUENTIAL — running instances drain naturally and existing queued instances are left untouched until a new arrival triggers the collapse.

With this behavior, all existing queued instances will automatically collapse to last_only as soon as a new instance is enqueued.

I think we should keep it this way, since it closely mirrors the behavior and transition semantics of the existing Sequential run strategy.

Sounds good, I thought it might be safe to disallow from get go but this might not be a common pattern and can be fixed easily later if needed, so it should be good to go with this.

praneethy91 · 2026-05-30T00:29:40Z

            case LAST_ONLY:
              return null; // no queueing support
+            case SERIAL_LATEST_ONLY:
+              return dequeueWorkflowInstances(workflowId, 1, false);


Can we pass runStrategy.getWorkflowConcurrency() to be consistent with the others and not hardcode 1 here? This way if we ever dynamically change this, like for example through Actions.java (properties update path )when we want to pause using concurrency updates we can pause a SLO workflow

makes sense and merge it with the STRICT_SEQUENTIAL switch statement

praneethy91 · 2026-05-30T00:35:55Z

+  }
+
+  @Test
+  public void testStartRunStrategyWithSerialLatestOnlyHonorsRestart() {


since restarting an older instance stops a newer queued instance it is slightly surprising behavior - we have to get confirmation from our internal use-case if that's fine as it could genuinely lead to loss of processing latest data, let's confirm before we get this PR change in

Yes, this is one of the behaviors called out in the PR description and is consistent with how other run strategies handle restarts. For example:

last_only: restarting an instance can stop the currently running instance.

first_only: a restarted instance may be automatically terminated.

parallel: a restarted instance may still be queued if the concurrency limit has already been reached, regardless of whether it represents the most recent run.

I think this is reasonable because a restart is an explicit manual action, so clearly documenting the side effect should be sufficient.

Restarting an older instance produces a new run that is subject to the full SLO contract — any currently queued instance is stopped and the restart becomes the new queued candidate.

I checked with our internal use case, this restart behavior is acceptable.

praneethy91 · 2026-05-30T00:45:30Z

+  }
+
+  @Test
+  public void testStartRunStrategyWithSerialLatestOnly() {


before we merge this: the dequeue tests cover the decision in isolation with mocks, but we don't have anything
exercising the actual running → terminate → re-dequeue handoff. Could we add a DAO-level test
that marks instance 1 running, enqueues 2, asserts dequeue returns nothing, then terminates
1 and asserts dequeue now returns 2? Another variant would seed a second queued row and assert the latest survives the collapse.

Mentioned in above comment but if you enqueue a backlog under SEQUENTIAL and flip to SERIAL_LATEST_ONLY with nothing running, dequeue drains oldest-first instead of collapsing until the next arrival. We can add a test here with that behavior if we intend to keep it but my vote would be to disallow it in the code and then assert that it is diallowed here

derek-miller · 2026-05-31T14:02:32Z

+     * running instance reaches a terminal state, the single queued instance is dequeued to run. The
+     * running instance is never terminated by a new arrival.
+     */
+    SERIAL_LATEST_ONLY;


Since the two other strategies we are bridging are called SERIAL and LAST_ONLY, would it be more consistent to call this one SERIAL_LAST_ONLY instead of latest?

I chose latest_only instead of last_only for the same reason we use the serial prefix instead of sequential. Both sequential and last_only already have established meanings in the system, so this naming avoids confusion and keeps things consistent.

praneethy91

Thanks for the improvements, overall looks good with good test coverage.

praneethy91 · 2026-06-01T21:16:07Z

+  private static final String SERIAL_LATEST_ONLY_TIMELINE_TEMPLATE =
+      "[\"With SERIAL_LATEST_ONLY run strategy, this run is stopped because new instance %s with run %s arrived.\"]";
+
+  private static final String STOP_SERIAL_LATEST_ONLY_QUEUED_INSTANCE_QUERY =


Sounds good, I thought it might be safe to disallow from get go but this might not be a common pattern and can be fixed easily later if needed, so it should be good to go with this.

Implement SERIAL_LATEST_ONLY run strategy

c40a3f1

akashdw requested a review from praneethy91 May 28, 2026 19:51

anjujha reviewed May 29, 2026

View reviewed changes

ykitaev reviewed May 29, 2026

View reviewed changes

Simplify startSerialLatestOnlyInstances, add Javadoc, and add missing…

bdb3373

… batch path test coverage

ykitaev approved these changes May 29, 2026

View reviewed changes

praneethy91 reviewed May 30, 2026

View reviewed changes

praneethy91 requested changes May 30, 2026

View reviewed changes

praneethy91 reviewed May 30, 2026

View reviewed changes

address code review comments

fabf8c5

derek-miller approved these changes May 31, 2026

View reviewed changes

praneethy91 approved these changes Jun 2, 2026

View reviewed changes

VD44 approved these changes Jun 2, 2026

View reviewed changes

akashdw merged commit d57170d into main Jun 2, 2026
1 check passed

Conversation

akashdw commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request type

Changes in this PR

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akashdw May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ykitaev May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

praneethy91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

akashdw commented May 28, 2026 •

edited

Loading

akashdw May 29, 2026 •

edited

Loading

ykitaev May 29, 2026 •

edited

Loading