metal : graceful failure instead of GGML_ABORT#1527
Open
fiorelorenzo wants to merge 1 commit into
Open
Conversation
A failed Metal command buffer (device lost, OOM, hang) currently calls GGML_ABORT in ggml_metal_synchronize and kills the whole process, so an embedding application cannot catch it and recover. Replace the two GGML_ABORT in ggml_metal_synchronize with a synchronize_failed flag and an early return (releasing the already-checked external command buffers first). ggml_metal_graph_compute checks the flag at entry and also synchronizes + checks its own command buffers before returning success, so a failure is reported as GGML_STATUS_FAILED for that compute, not only the next one (which never runs for the final graph). Callers read results right after each compute, so the added synchronize costs little. ggml_metal_rsets_free tolerates resource sets still alive at process teardown: warn and free the struct instead of asserting, which otherwise use-after-frees the leaked backends on exit.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A failed Metal command buffer (device lost, OOM, GPU hang) currently calls GGML_ABORT inside ggml_metal_synchronize, which terminates the whole process. An application embedding ggml-metal has no way to catch this and recover.
Changes:
!= GGML_STATUS_SUCCESS, so they get a clean error instead of the process aborting. Callers read results right after each compute, so the added synchronize costs little.Testing: built ggml-metal with -DGGML_METAL=ON on macOS (Apple Silicon). The change converts an abort into a status code that the existing callers already check; I did not synthetically inject a command buffer error.
AI usage: I wrote the synchronize_failed flag, the GGML_ABORT removals, and the rsets_free teardown change. I used an AI assistant to add the end-of-compute synchronize check and to help word this description.