metal : graceful failure instead of GGML_ABORT by fiorelorenzo · Pull Request #1527 · ggml-org/ggml

fiorelorenzo · 2026-06-03T10:49:07Z

A failed Metal command buffer (device lost, OOM, GPU hang) currently calls GGML_ABORT inside ggml_metal_synchronize, which terminates the whole process. An application embedding ggml-metal has no way to catch this and recover.

Changes:

ggml_metal_synchronize: replace the two GGML_ABORT with a synchronize_failed flag and an early return (releasing the already-checked external command buffers first).
ggml_metal_graph_compute: check the flag at entry, and before returning success, synchronize and check its own command buffers, so a failure is reported as GGML_STATUS_FAILED for that compute rather than only the next one (which never runs for the final graph). The existing graph-compute callers already check != GGML_STATUS_SUCCESS, so they get a clean error instead of the process aborting. Callers read results right after each compute, so the added synchronize costs little.
ggml_metal_rsets_free: tolerate resource sets still alive at process teardown (warn and free the struct) instead of asserting, which otherwise use-after-frees the leaked backends on exit.

Testing: built ggml-metal with -DGGML_METAL=ON on macOS (Apple Silicon). The change converts an abort into a status code that the existing callers already check; I did not synthetically inject a command buffer error.

AI usage: I wrote the synchronize_failed flag, the GGML_ABORT removals, and the rsets_free teardown change. I used an AI assistant to add the end-of-compute synchronize check and to help word this description.

A failed Metal command buffer (device lost, OOM, hang) currently calls GGML_ABORT in ggml_metal_synchronize and kills the whole process, so an embedding application cannot catch it and recover. Replace the two GGML_ABORT in ggml_metal_synchronize with a synchronize_failed flag and an early return (releasing the already-checked external command buffers first). ggml_metal_graph_compute checks the flag at entry and also synchronizes + checks its own command buffers before returning success, so a failure is reported as GGML_STATUS_FAILED for that compute, not only the next one (which never runs for the final graph). Callers read results right after each compute, so the added synchronize costs little. ggml_metal_rsets_free tolerates resource sets still alive at process teardown: warn and free the struct instead of asserting, which otherwise use-after-frees the leaked backends on exit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : graceful failure instead of GGML_ABORT#1527

metal : graceful failure instead of GGML_ABORT#1527
fiorelorenzo wants to merge 1 commit into
ggml-org:masterfrom
fiorelorenzo:fix/metal-graceful-failure

fiorelorenzo commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fiorelorenzo commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant