Skip to content

zml/platform: support ROCm allocator configuration via env vars#581

Open
abn wants to merge 1 commit into
zml:masterfrom
abn:rocm-allocator-env
Open

zml/platform: support ROCm allocator configuration via env vars#581
abn wants to merge 1 commit into
zml:masterfrom
abn:rocm-allocator-env

Conversation

@abn

@abn abn commented Jun 18, 2026

Copy link
Copy Markdown

This PR enables configuring OpenXLA allocator settings on ROCm targets and ensures that ZML respects user-defined allocator overrides set via standard environment variables.

Previously, the rocm platform target configuration in ZML was a stub (struct {} = .{}, which wrote no PJRT client NamedValue options). On AMD ROCm targets (specifically unified memory APUs like AMD Strix Halo), the ROCm PJRT driver (libpjrt_rocm.so) falls back to its default behavior of preallocating 90% of available memory. This frequently triggers system-level Out-of-Memory (OOM) events when running concurrent compilation or serving workloads, even if the user configures lower limits in their environment.

This change:

  • Updates writeNamedValues to check if standard XLA environment variables (XLA_CLIENT_PREALLOCATE and XLA_CLIENT_MEM_FRACTION, along with their XLA_PYTHON_* equivalents) are set.
  • If they are set, ZML bypasses serialization for those options, allowing the underlying OpenXLA engine to parse and apply the environment variable values directly.
  • Falls back to bfc allocator when async is requested on ROCm (since cuda_async is CUDA-only).
  • Preserves configured code options when no environment overrides are present.
  • Adds a new unit test CreateOptions.toNamedValues allocator config covering all environment variable bypass and fallback cases.

Platform Impact

  • ROCm: Resolves driver default-override limitations and prevents unified memory OOM crashes under restricted UMA/GTT carve-outs.
  • CUDA: Unifies the configuration behavior with ROCm and ensures environment overrides work consistently.

Verification

Compiled and verified the core ZML tests using:

./bazel.sh test //zml:test

Enables configuring OpenXLA allocator settings on ROCm targets and
ensures that ZML respects user-defined allocator overrides set via
standard environment variables.

- Updates the shared XlaGpu configuration target to support ROCm-specific allocator logic (falling back to bfc if async is requested).
- Updates writeNamedValues to check if standard XLA environment variables
  are set, and if so, bypasses serialization for those options.
- Passes the is_rocm flag through toNamedValues to handle differences
  between CUDA and ROCm targets dynamically.
- Adds a unit test to verify allocator configurations, fallbacks, and env var overrides.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant