Skip to content

metal : graceful failure instead of GGML_ABORT#1527

Open
fiorelorenzo wants to merge 1 commit into
ggml-org:masterfrom
fiorelorenzo:fix/metal-graceful-failure
Open

metal : graceful failure instead of GGML_ABORT#1527
fiorelorenzo wants to merge 1 commit into
ggml-org:masterfrom
fiorelorenzo:fix/metal-graceful-failure

Conversation

@fiorelorenzo

Copy link
Copy Markdown

A failed Metal command buffer (device lost, OOM, GPU hang) currently calls GGML_ABORT inside ggml_metal_synchronize, which terminates the whole process. An application embedding ggml-metal has no way to catch this and recover.

Changes:

  • ggml_metal_synchronize: replace the two GGML_ABORT with a synchronize_failed flag and an early return (releasing the already-checked external command buffers first).
  • ggml_metal_graph_compute: check the flag at entry, and before returning success, synchronize and check its own command buffers, so a failure is reported as GGML_STATUS_FAILED for that compute rather than only the next one (which never runs for the final graph). The existing graph-compute callers already check != GGML_STATUS_SUCCESS, so they get a clean error instead of the process aborting. Callers read results right after each compute, so the added synchronize costs little.
  • ggml_metal_rsets_free: tolerate resource sets still alive at process teardown (warn and free the struct) instead of asserting, which otherwise use-after-frees the leaked backends on exit.

Testing: built ggml-metal with -DGGML_METAL=ON on macOS (Apple Silicon). The change converts an abort into a status code that the existing callers already check; I did not synthetically inject a command buffer error.

AI usage: I wrote the synchronize_failed flag, the GGML_ABORT removals, and the rsets_free teardown change. I used an AI assistant to add the end-of-compute synchronize check and to help word this description.

A failed Metal command buffer (device lost, OOM, hang) currently calls
GGML_ABORT in ggml_metal_synchronize and kills the whole process, so an
embedding application cannot catch it and recover.

Replace the two GGML_ABORT in ggml_metal_synchronize with a
synchronize_failed flag and an early return (releasing the already-checked
external command buffers first). ggml_metal_graph_compute checks the flag at
entry and also synchronizes + checks its own command buffers before returning
success, so a failure is reported as GGML_STATUS_FAILED for that compute, not
only the next one (which never runs for the final graph). Callers read results
right after each compute, so the added synchronize costs little.

ggml_metal_rsets_free tolerates resource sets still alive at process teardown:
warn and free the struct instead of asserting, which otherwise use-after-frees
the leaked backends on exit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant