Add strategies to deal with workflow step failures

### Description

Currently, when a workflow step fails, we keep it in the state table so it is picked up when the workflow task runs again. This is an auto-retry mechanism that may cause the workflow to be stuck in that step indefinitely until the admin either fixes the step (if possible), or cancel the execution of the workflow for that resource.

One of the shortcomings of this approach is that there's no update to the step, so when we fetch the active workflows for a resource we don't see it as failed, we only see that it is scheduled to run. In other words, there's no differentiation between a regularly scheduled step and a failed one.

We should introduce a configuration option to handle failures using different strategies:

- retry: same behavior as we have today, but we should introduce a max-retries so that the step doesn't run indefinitely
- skip: log that the step failed and skip it, moving the workflow to the next step automatically
- cancel: log the failure and cancel the workflow due to the failed step

This could be a config for the whole workflow (something like `on-failure: retry=5` defined at the workflow-level, but also a config at the step level, allowing specific steps to use a different strategy.

```
name: my workflow
...
on-failure: retry=5                  # retry at most 5 times, after that the step is not picked up to run again
steps:
  - uses: my-step
    with:
      on-failure: skip                  # if this step fail, it is ok to skip it
  - used: another-step            # this step uses the same strategy defined for the workflow
  ....
```

In the retry strategy we would need to keep track of failed steps, so the state table should have proper columns for that (`status: FAILED`, `failure-reason: error message caught by the code`, `retry-count: 0`). The API would need some adjustments as well, allowing admins to fetch all steps currently failing through the API, analyse the problems, and then run some actions on these steps (cancel the workflow execution, retry the step manually, skip the step, or even migrate the resource to another step/workflow).

### Value Proposition

This enhancement would take workflows automatic error handling to another level, allowing admins to define when it is safe to skip a step, when it should be retried, how many times it should be retried. Also, it would allow admins to easily fetch and visualize failing steps and act on them.

### Goals

Offer admins choices when it comes to handling workflow step failures automatically. Retrying a step will be the default behavior to keep it consistent with the current implementation where a failing step is kept in the state table so that it is re-run the next time the workflow task runs.

### Non-Goals

Retry with back-off at the executor level is out of scope. That is, we aim at improving the error handling at the step execution level, and retrying a step means keeping it in the state table so the step is executed again when the workflow task runs. We don't aim at retrying failing steps in the same execution with this enhancement.

We might at some point at some retry with back-off logic at the step level before determining the step failed, but it will be a separate enhancement that needs to take the executor timeout into consideration.

### Discussion

_No response_

### Notes

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add strategies to deal with workflow step failures #46317

Description

Value Proposition

Goals

Non-Goals

Discussion

Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add strategies to deal with workflow step failures #46317

Description

Description

Value Proposition

Goals

Non-Goals

Discussion

Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions