platforms/neuron: introduce NKI stack with SDPA kernel#528
Open
hugomano wants to merge 1 commit into
Open
Conversation
39e1374 to
b3e0185
Compare
fd8b023 to
a2af7bf
Compare
a9f8f1f to
1969fef
Compare
a2af7bf to
c4c1b50
Compare
f658bd1 to
f1e3bcd
Compare
0e2a650 to
9282e56
Compare
1bebdf2 to
03f638e
Compare
03f638e to
530993c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces the Neuron NKI stack into ZML and wires it into the LLM attention path.
The main goal is to make Neuron attention run through NKI kernels.
It adds the compiler/runtime plumbing needed to embed NKI kernels as Neuron custom-native calls, then uses that path for SDPA-style prefill/decode attention. Paged Attention will come later.
On inf2.8xlarge it runs :
Neuron NKI compiler integration
Adds a new
zml.ops.neuronNki(...)API that lowers a Zig tensor operation to a StableHLO custom call with:call_target_name = "AwsNeuronCustomNativeKernel"backend_configattached to the custom callThe compiler bridge lives in:
platforms/neuron/nki_kernel.zigplatforms/neuron/nki_kernel_compiler.pyplatforms/neuron/nki_compiler_launcher.zigplatforms/neuron/python_launcher.zigAt graph construction time, ZML resolves the NKI source file from Bazel runfiles, writes a JSON compile request, invokes the sandboxed
nki-cclauncher, and reads back the base64 backend config expected by Neuron’s custom-native kernel path.The compiler target is inferred from the actual Neuron instance through
nrt_get_instance_info, mapping supported families to targets.Neuron runtime
The Neuron platform packaging is updated.
The previous setup carried a local
libneuronxla.zigshim and initialized Python directly from the PJRT proxy so ZML could expose the compiler pieces expected by Neuron. Withlibneuronxlav0.3, the AWS package now owns more of that compiler/runtime surface, free of python, so the sandbox is rebuilt around the packagedlibneuronxlapayload instead of ZML’s local replacement.nki>=0.4.0,<0.5aws-neuronx-tools,libgcc_s1, andlibstdcpp6required as needed by the so libraryneuronx-ccandnki-ccbinariesNEURON_CC_FLAGS, sandboxPATHetc.NKI attention backend
Adds a new attention backend:
zml.attention.attention.Backend.nki.neuronexamples/llm --backend=nkiStableHLO/lowering compatibility fixes for Neuron
I added shims in a few Neuron-sensitive lowering paths:
These shims are isolated in
zml.ops.LoweringCompatibilityso model code can continue expressing normal StableHLO semantics. The gather/scatter sentinel handling is related to upstream Neuron issue aws-neuron/aws-neuron-sdk#1335Also for
topka specific branch has been added until the Neuron team fix aws-neuron/aws-neuron-sdk#1339 . And I asked for a in graph level data movement API: aws-neuron/aws-neuron-sdk#1340Neuron profiling workflow
Adds a repo-level Neuron profiling workflow with documentation:
--config=neuron-profile//tools/neuron:profile//tools/neuron:server//tools/neuron:ingest//tools/neuron:summary-json//tools/neuron:summary-txt//tools/neuron:summary-perfettoKnown follow-ups
Related doc