ggml-cuda: Delta-Net linear attention for Qwen3-Next #18102

hauhaut · 2025-12-16T15:41:58Z

cuda kernel for delta-net linear attention layers in qwen3next.

adds GGML_OP_DELTA_NET + recurrent kernel for decode, blackwell path (sm12.0+) for prefill with 64k shmem. also improved solve_tri for the chunked prefill path.

getting ~45-55 t/s on q4/mxfp4 and ~40 t/s bf16 on 80B-A3B (blackwell). pre-blackwell cards get ~38-40 t/s from solve_tri improvements (baseline was the original ~20 t/s).

Edit: omitted some small bits.

JohannesGaessler

I am in principle willing to review the CUDA code but from a broader ggml perspective the PR cannot be merged like this. Preferably implement your kernel as a fused operation that is entirely contained within the CUDA backend. If this is not possible, support in the CPU backend is mandatory both as a fallback for other backends and to assert that the new ggml op is working correctly in test-backend-ops. (For a fused op new tests in test-backend-ops should also be added.)

cc @am17an (who has recently worked on fusion within the CUDA backend)

hauhaut · 2025-12-16T16:35:33Z

I am in principle willing to review the CUDA code but from a broader ggml perspective the PR cannot be merged like this. Preferably implement your kernel as a fused operation that is entirely contained within the CUDA backend. If this is not possible, support in the CPU backend is mandatory both as a fallback for other backends and to assert that the new ggml op is working correctly in test-backend-ops. (For a fused op new tests in test-backend-ops should also be added.)

cc @am17an (who has recently worked on fusion within the CUDA backend)

thanks for the feedback. looked into the fused-op-only approach but delta-net has recurrent state that persists across calls - similar to mamba's ssm_scan or rwkv's wkv ops. the state update semantics are subtle enough that pattern-based fusion would be fragile. will add cpu fallback + test-backend-ops tests. should have that up soon.

hauhaut · 2025-12-16T17:24:52Z

Added CPU fallback and test-backend-ops coverage. CUDA passes against CPU reference.

$ ./test-backend-ops -o DELTA_NET
Backend 1/4: CUDA0 (RTX 6000 Ada)
DELTA_NET(type=f32,n_heads=8,head_dim=64,n_tokens=1,n_seqs=1): OK
DELTA_NET(type=f32,n_heads=8,head_dim=64,n_tokens=32,n_seqs=1): OK
DELTA_NET(type=f32,n_heads=8,head_dim=64,n_tokens=32,n_seqs=2): OK
DELTA_NET(type=f32,n_heads=8,head_dim=64,n_tokens=128,n_seqs=2): OK
4/4 tests passed
Backend 2/4: CUDA1 (RTX PRO 6000 Blackwell)
4/4 tests passed
Backend 3/4: CUDA2 (RTX PRO 6000 Blackwell)
4/4 tests passed
4/4 backends passed

pwilkin · 2025-12-16T17:51:45Z

Would be good to ask @ggerganov for his opinion because when I was implementing Qwen3Next he said he didn't want to add custom per-model kernels.

hauhaut · 2025-12-16T17:57:51Z

GGML_OP_DELTA_NET isn't a per-model kernel. It's a general linear attention op, same category as GLA, WKV6/7, and SSM_SCAN. These exist in ggml because they're architectural primitives that could be used by any model implementing that attention mechanism. Happy to hear ggerganov's take though (and to drop the PR of course if not feasible for llama.cpp)

jeffbolznv · 2025-12-17T01:58:11Z

I'm not an ML expert, but this does seem to be an operation that appears in multiple models and is "too much" to reconstruct in fusion. I can't comment on the correctness of this definition, but if it's self-contained like this and can replace this whole block in qwen3next it seems appealing to have as an op.

Did you use AI to write the code? It seems like too many variants stamped out that don't need to be (can just be templated) in a way that AI might do. Also the weird deleting of comments...

hauhaut · 2025-12-17T02:43:02Z

I'm not an ML expert, but this does seem to be an operation that appears in multiple models and is "too much" to reconstruct in fusion. I can't comment on the correctness of this definition, but if it's self-contained like this and can replace this whole block in qwen3next it seems appealing to have as an op.

Did you use AI to write the code? It seems like too many variants stamped out that don't need to be (can just be templated) in a way that AI might do. Also the weird deleting of comments...

The kernel variants aren't duplicates. They target different memory hierarchies (global vs 64KB shared), data types (FP32 vs FP16 with half2), and parallelization strategies (single block vs column-parallel). Templating them together would generate the same code with more complex dispatch. The deleted comments were noise (// Apply sigmoid before ggml_sigmoid()). Happy to discuss specific consolidations if you see any.

Yes, I use AI for scaffolding and iteration. The kernels went through extensive validation and debugging before landing here.

am17an · 2025-12-17T03:27:35Z

The kernel variants aren't duplicates. They target different memory hierarchies (global vs 64KB shared), data types (FP32 vs FP16 with half2), and parallelization strategies (single block vs column-parallel). Templating them together would generate the same code with more complex dispatch. The deleted comments were noise (// Apply sigmoid before ggml_sigmoid()). Happy to discuss specific consolidations if you see any.

This also seems to be generated by AI

Yes, I use AI for scaffolding and iteration. The kernels went through extensive validation and debugging before landing here.

I see that the code assumes Blackwell has 64kb per block and has 2 separate kernels for it, which it doesn't according to the spec (still returns 48kb). So I am suspicious of this claim.

In general I believe the PR could be useful but I have low confidence in its claims, and I'm not going to review a 1400 line cuda kernel. I think it could be consolidated to a single kernel with some templates for data-types, even then the AI responses to questions are off-putting

IIIIIllllIIIIIlllll · 2025-12-17T03:30:21Z

Fantastic work!
I managed to get this branch running on ROCm with some straightforward modifications. The performance improvement on my AI MAX+ 395 is very noticeable. I really hope to see this merged into the main branch once it's been refined a bit more.

(Of course, I am not familiar with the relevant knowledge; I am just sharing some test results.)

before (master):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        587.16 ± 0.89 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         27.54 ± 0.04 |

build: unknown (0)

after (this PR):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        516.30 ± 0.72 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         30.00 ± 0.04 |

build: unknown (0)

hauhaut · 2025-12-17T03:36:21Z

Fantastic work! I managed to get this branch running on ROCm with some straightforward modifications. The performance improvement on my AI MAX+ 395 is very noticeable. I really hope to see this merged into the main branch once it's been refined a bit more.

commad:

before (master):

**ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        587.16 ± 0.89 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         27.54 ± 0.04 |

build: unknown (0)**

after (this PR):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |          pp2048 |        516.30 ± 0.72 |
| qwen3next 80B.A3B Q8_0         |  79.57 GiB |    79.67 B | ROCm       |  99 |     2048 |  1 |    0 |            tg32 |         30.00 ± 0.04 |

build: unknown (0)

Thank you, we'll see what the others think. Starting to regret a little getting convinced to share.

jeffbolznv · 2025-12-17T03:40:43Z

The kernel variants aren't duplicates. They target different memory hierarchies (global vs 64KB shared), data types (FP32 vs FP16 with half2), and parallelization strategies (single block vs column-parallel).

I may have missed some subtleties in the delta net kernels, it's a lot of code and I only skimmed it. But I can say with some confidence that >800 lines of solve_tri is overkill ;-)

I see that this change takes qwen3next from 11185 down to 6001 nodes, at first glance that seems really good. I get mixed performance results - +10% for tg128, -20% for pp512 (cuda backend, rtx 5090). But the cuda backend is currently significantly slower than the vulkan backend so there's some kind of perf bug, and it makes it hard to make performance judgments on this change until that issue gets resolved.

hauhaut added 2 commits December 16, 2025 16:40

ggml-cuda: Delta-Net linear attention for Qwen3-Next

4114537

qwen3next: trim comments

0a19293

hauhaut requested review from CISC and ggerganov as code owners December 16, 2025 15:41

hauhaut closed this Dec 16, 2025

hauhaut reopened this Dec 16, 2025

JohannesGaessler reviewed Dec 16, 2025

View reviewed changes

loci-dev mentioned this pull request Dec 16, 2025

UPSTREAM PR #18102: ggml-cuda: Delta-Net linear attention for Qwen3-Next auroralabs-loci/llama.cpp#593

Open

ggml-cpu: add DELTA_NET backend + tests

128a6c2

github-actions bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-cuda: Delta-Net linear attention for Qwen3-Next #18102

ggml-cuda: Delta-Net linear attention for Qwen3-Next #18102

hauhaut commented Dec 16, 2025 •

edited

Loading

Uh oh!

JohannesGaessler left a comment

Uh oh!

hauhaut commented Dec 16, 2025

Uh oh!

hauhaut commented Dec 16, 2025

Uh oh!

pwilkin commented Dec 16, 2025

Uh oh!

hauhaut commented Dec 16, 2025

Uh oh!

jeffbolznv commented Dec 17, 2025

Uh oh!

hauhaut commented Dec 17, 2025

Uh oh!

am17an commented Dec 17, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 17, 2025 •

edited

Loading

Uh oh!

hauhaut commented Dec 17, 2025

Uh oh!

jeffbolznv commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggml-cuda: Delta-Net linear attention for Qwen3-Next #18102

Are you sure you want to change the base?

ggml-cuda: Delta-Net linear attention for Qwen3-Next #18102

Conversation

hauhaut commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

hauhaut commented Dec 16, 2025

Uh oh!

hauhaut commented Dec 16, 2025

Uh oh!

pwilkin commented Dec 16, 2025

Uh oh!

hauhaut commented Dec 16, 2025

Uh oh!

jeffbolznv commented Dec 17, 2025

Uh oh!

hauhaut commented Dec 17, 2025

Uh oh!

am17an commented Dec 17, 2025

Uh oh!

IIIIIllllIIIIIlllll commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hauhaut commented Dec 17, 2025

Uh oh!

jeffbolznv commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

hauhaut commented Dec 16, 2025 •

edited

Loading

IIIIIllllIIIIIlllll commented Dec 17, 2025 •

edited

Loading