zml/platform: support ROCm allocator configuration via env vars#581
Open
abn wants to merge 1 commit into
Open
Conversation
Enables configuring OpenXLA allocator settings on ROCm targets and ensures that ZML respects user-defined allocator overrides set via standard environment variables. - Updates the shared XlaGpu configuration target to support ROCm-specific allocator logic (falling back to bfc if async is requested). - Updates writeNamedValues to check if standard XLA environment variables are set, and if so, bypasses serialization for those options. - Passes the is_rocm flag through toNamedValues to handle differences between CUDA and ROCm targets dynamically. - Adds a unit test to verify allocator configurations, fallbacks, and env var overrides.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR enables configuring OpenXLA allocator settings on ROCm targets and ensures that ZML respects user-defined allocator overrides set via standard environment variables.
Previously, the
rocmplatform target configuration in ZML was a stub (struct {} = .{}, which wrote no PJRT client NamedValue options). On AMD ROCm targets (specifically unified memory APUs like AMD Strix Halo), the ROCm PJRT driver (libpjrt_rocm.so) falls back to its default behavior of preallocating 90% of available memory. This frequently triggers system-level Out-of-Memory (OOM) events when running concurrent compilation or serving workloads, even if the user configures lower limits in their environment.This change:
writeNamedValuesto check if standard XLA environment variables (XLA_CLIENT_PREALLOCATEandXLA_CLIENT_MEM_FRACTION, along with theirXLA_PYTHON_*equivalents) are set.bfcallocator whenasyncis requested on ROCm (sincecuda_asyncis CUDA-only).CreateOptions.toNamedValues allocator configcovering all environment variable bypass and fallback cases.Platform Impact
Verification
Compiled and verified the core ZML tests using:
./bazel.sh test //zml:test