zml/platform: support ROCm allocator configuration via env vars by abn · Pull Request #581 · zml/zml

abn · 2026-06-18T15:24:19Z

This PR enables configuring OpenXLA allocator settings on ROCm targets and ensures that ZML respects user-defined allocator overrides set via standard environment variables.

Previously, the rocm platform target configuration in ZML was a stub (struct {} = .{}, which wrote no PJRT client NamedValue options). On AMD ROCm targets (specifically unified memory APUs like AMD Strix Halo), the ROCm PJRT driver (libpjrt_rocm.so) falls back to its default behavior of preallocating 90% of available memory. This frequently triggers system-level Out-of-Memory (OOM) events when running concurrent compilation or serving workloads, even if the user configures lower limits in their environment.

This change:

Updates writeNamedValues to check if standard XLA environment variables (XLA_CLIENT_PREALLOCATE and XLA_CLIENT_MEM_FRACTION, along with their XLA_PYTHON_* equivalents) are set.
If they are set, ZML bypasses serialization for those options, allowing the underlying OpenXLA engine to parse and apply the environment variable values directly.
Falls back to bfc allocator when async is requested on ROCm (since cuda_async is CUDA-only).
Preserves configured code options when no environment overrides are present.
Adds a new unit test CreateOptions.toNamedValues allocator config covering all environment variable bypass and fallback cases.

Platform Impact

ROCm: Resolves driver default-override limitations and prevents unified memory OOM crashes under restricted UMA/GTT carve-outs.
CUDA: Unifies the configuration behavior with ROCm and ensures environment overrides work consistently.

Verification

Compiled and verified the core ZML tests using:

./bazel.sh test //zml:test

Enables configuring OpenXLA allocator settings on ROCm targets and ensures that ZML respects user-defined allocator overrides set via standard environment variables. - Updates the shared XlaGpu configuration target to support ROCm-specific allocator logic (falling back to bfc if async is requested). - Updates writeNamedValues to check if standard XLA environment variables are set, and if so, bypasses serialization for those options. - Passes the is_rocm flag through toNamedValues to handle differences between CUDA and ROCm targets dynamically. - Adds a unit test to verify allocator configurations, fallbacks, and env var overrides.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zml/platform: support ROCm allocator configuration via env vars#581

zml/platform: support ROCm allocator configuration via env vars#581
abn wants to merge 1 commit into
zml:masterfrom
abn:rocm-allocator-env

abn commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

abn commented Jun 18, 2026

Platform Impact

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant