-
Notifications
You must be signed in to change notification settings - Fork 14.1k
ggml-cuda: Delta-Net linear attention for Qwen3-Next #18102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in principle willing to review the CUDA code but from a broader ggml perspective the PR cannot be merged like this. Preferably implement your kernel as a fused operation that is entirely contained within the CUDA backend. If this is not possible, support in the CPU backend is mandatory both as a fallback for other backends and to assert that the new ggml op is working correctly in test-backend-ops. (For a fused op new tests in test-backend-ops should also be added.)
cc @am17an (who has recently worked on fusion within the CUDA backend)
thanks for the feedback. looked into the fused-op-only approach but delta-net has recurrent state that persists across calls - similar to mamba's ssm_scan or rwkv's wkv ops. the state update semantics are subtle enough that pattern-based fusion would be fragile. will add cpu fallback + test-backend-ops tests. should have that up soon. |
|
Added CPU fallback and test-backend-ops coverage. CUDA passes against CPU reference. $ ./test-backend-ops -o DELTA_NET |
|
Would be good to ask @ggerganov for his opinion because when I was implementing Qwen3Next he said he didn't want to add custom per-model kernels. |
|
GGML_OP_DELTA_NET isn't a per-model kernel. It's a general linear attention op, same category as GLA, WKV6/7, and SSM_SCAN. These exist in ggml because they're architectural primitives that could be used by any model implementing that attention mechanism. Happy to hear ggerganov's take though (and to drop the PR of course if not feasible for llama.cpp) |
|
I'm not an ML expert, but this does seem to be an operation that appears in multiple models and is "too much" to reconstruct in fusion. I can't comment on the correctness of this definition, but if it's self-contained like this and can replace this whole block in qwen3next it seems appealing to have as an op. Did you use AI to write the code? It seems like too many variants stamped out that don't need to be (can just be templated) in a way that AI might do. Also the weird deleting of comments... |
The kernel variants aren't duplicates. They target different memory hierarchies (global vs 64KB shared), data types (FP32 vs FP16 with half2), and parallelization strategies (single block vs column-parallel). Templating them together would generate the same code with more complex dispatch. The deleted comments were noise (// Apply sigmoid before ggml_sigmoid()). Happy to discuss specific consolidations if you see any. Yes, I use AI for scaffolding and iteration. The kernels went through extensive validation and debugging before landing here. |
This also seems to be generated by AI
I see that the code assumes Blackwell has 64kb per block and has 2 separate kernels for it, which it doesn't according to the spec (still returns 48kb). So I am suspicious of this claim. In general I believe the PR could be useful but I have low confidence in its claims, and I'm not going to review a 1400 line cuda kernel. I think it could be consolidated to a single kernel with some templates for data-types, even then the AI responses to questions are off-putting |
|
Fantastic work! (Of course, I am not familiar with the relevant knowledge; I am just sharing some test results.) before (master): after (this PR): |
Thank you, we'll see what the others think. Starting to regret a little getting convinced to share. |
I may have missed some subtleties in the delta net kernels, it's a lot of code and I only skimmed it. But I can say with some confidence that >800 lines of solve_tri is overkill ;-) I see that this change takes qwen3next from 11185 down to 6001 nodes, at first glance that seems really good. I get mixed performance results - +10% for tg128, -20% for pp512 (cuda backend, rtx 5090). But the cuda backend is currently significantly slower than the vulkan backend so there's some kind of perf bug, and it makes it hard to make performance judgments on this change until that issue gets resolved. |
cuda kernel for delta-net linear attention layers in qwen3next.
adds GGML_OP_DELTA_NET + recurrent kernel for decode, blackwell path (sm12.0+) for prefill with 64k shmem. also improved solve_tri for the chunked prefill path.
getting ~45-55 t/s on q4/mxfp4 and ~40 t/s bf16 on 80B-A3B (blackwell). pre-blackwell cards get ~38-40 t/s from solve_tri improvements (baseline was the original ~20 t/s).
Edit: omitted some small bits.