Skip to content

Conversation

@benhoyt
Copy link
Contributor

@benhoyt benhoyt commented Oct 20, 2025

Exec tasks are not retryable (by design). However, if the Pebble daemon is hard-killed when an exec task is running, then when the daemon is restarted, the state engine tries to run that task (which was in Doing status) again, and returns an error from the task handler.

Simply log this at debug level instead of returning a task error, as it's not an error.

Before:

2025-10-20T22:29:41.286Z [pebble] Started daemon.
2025-10-20T22:29:41.287Z [pebble] Change 1 task (Execute command "sleep") failed: internal error: cannot get exec setup object for task "1"
2025-10-20T22:29:41.287Z [pebble] POST /v1/services 982.809µs 400 (http+unix)
2025-10-20T22:29:41.287Z [pebble] Cannot start default services: no default services

After:

2025-10-20T22:33:28.862Z [pebble] Started daemon.
2025-10-20T22:33:28.865Z [pebble] POST /v1/services 2.696031ms 400 (http+unix)
2025-10-20T22:33:28.866Z [pebble] Cannot start default services: no default services

To repro or test:

  • Start pebble run in one terminal
  • Run a long-running exec like pebble exec sleep 1000 in another terminal
  • kill -9 <pebble_run_pid>
  • Restart pebble run and (before the fix) you'll see the error log on startup

Fixes #587

Exec tasks are not retryable (by design). However, if the Pebble daemon
is hard-killed when an exec task is running, then when the daemon is
restarted, the state engine tries to run that task (which was in Doing
status) again, and returns an error from the task handler.

Simply log this at debug level instead of returning a task error, as
it's not an error.

Before:
2025-10-20T22:29:41.286Z [pebble] Started daemon.
2025-10-20T22:29:41.287Z [pebble] Change 1 task (Execute command "sleep") failed: internal error: cannot get exec setup object for task "1"
2025-10-20T22:29:41.287Z [pebble] POST /v1/services 982.809µs 400 (http+unix)
2025-10-20T22:29:41.287Z [pebble] Cannot start default services: no default services

After:
2025-10-20T22:33:28.862Z [pebble] Started daemon.
2025-10-20T22:33:28.865Z [pebble] POST /v1/services 2.696031ms 400 (http+unix)
2025-10-20T22:33:28.866Z [pebble] Cannot start default services: no default services

Fixes canonical#587
Copy link
Contributor

@tonyandrewmeyer tonyandrewmeyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice simple fix.

Repro'd the issue, and verified the fix locally, including with PEBBLE_DEBUG so I could see the debug log message.

setupObj := st.Cached(execSetupKey{task.ID()})
st.Unlock()
setup, ok := setupObj.(*execSetup)
if !ok || setup == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding things correctly, setupObj will be nil in the case we're interested in, and this will cover that case because nil can't be converted to an execSetup.

Would it be tighter to check if setupObj is nil in case there are other issues with the type check (that should error out)? Or do we want to just log in either case anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are actually four cases:

  1. missing from cache (!ok)
  2. in cache, but not an *execSetup type (!ok)
  3. in cache, but a nil *execSetup value (ok && setup==nil)
  4. in cache and a non-nil *execSetup value (ok && setup!=nil)

So this branch will be executed on 1, 2, or 3. Which I think is fine. Realistically, only 1 will happen because we've only put non-nil *execSetups in the map. So we could handle 2 and 3 more "harshly", but I don't think it hurts to handle them the same way.

@benhoyt benhoyt merged commit 7738d05 into canonical:master Oct 20, 2025
20 checks passed
@benhoyt benhoyt deleted the exec-retries branch October 20, 2025 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

exec tasks continuing after reboot

2 participants