[Killer app] Implement DeepSeek v4 Flash for SM120

DeepSeek V4 Flash seems to currently be the 192GB VRAM king (though I never managed to test MiMo V2.5 due to needing to limit context to 90K and in theory 200K being possible).

However it's currently stuck in endless churn in vLLM:
- https://github.com/vllm-project/vllm/pull/41834
- https://github.com/vllm-project/vllm/issues/40902 (no plan for GPUs before Hopper)
- https://github.com/vllm-project/vllm/pull/43477

Kernels and forks all over the place:
- https://github.com/aidendle94/vllm/tree/DSV4
- https://github.com/aidendle94/flashinfer

And DeepSeek doesn't plan to support SM120:
- https://github.com/deepseek-ai/DeepGEMM
- https://github.com/deepseek-ai/FlashMLA
- https://github.com/deepseek-ai/TileKernels

And FlashAttention has no MLA kernel and no plan soon:
- https://github.com/Dao-AILab/flash-attention/issues/1987

This requires:
- Mixed precision Fp8/MXFP4 support
- MLA kernel
- DSA kernel (DeepSeek Attention)
- NSA kernel / Lightning Indexer
- mHC kernel (manifold constrained Hyper-Connections)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Killer app] Implement DeepSeek v4 Flash for SM120 #28

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Killer app] Implement DeepSeek v4 Flash for SM120 #28

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions