Skip to content

Conversation

@sazczmh
Copy link
Collaborator

@sazczmh sazczmh commented Apr 8, 2025

By resuing the Accumulator registers of Tensor Cores to implement a 256x128 BlockTile structure, this approach significantly increases data reuse, reduces the demand for L2 Cache and HBM memory accesses, and enhances the SM's computational frequency, ultimately achieving FP8 performance exceeding 1,500+ TFLOPS.

M N K Base BMxBN Computation Opti BMxBN Computation Speedup
4096 24576 1536 128x160 1162 TF 256x128 1204 TF 3.61%
4096 32768 512 128x160 801 TF 256x128 777 TF -3.00%
4096 7168 16384 128x160 1451 TF 256x128 1500 TF 3.38%
4096 4096 7168 128x160 1304 TF 256x128 1377 TF 5.60%
4096 7168 2048 128x160 1185 TF 256x128 1159 TF -2.19%

Test on “H800”-SXM && CUDA 12.8.1

@sazczmh sazczmh added the perf label Apr 8, 2025
@sazczmh sazczmh self-assigned this Apr 8, 2025
@LyricZhao LyricZhao force-pushed the blocktile-256x128 branch from 1eeb98a to 48a5f07 Compare April 9, 2025 02:01
@LyricZhao LyricZhao requested a review from zheanxu April 9, 2025 03:10
@LyricZhao LyricZhao merged commit fed3e4d into main Apr 9, 2025
@LyricZhao LyricZhao deleted the blocktile-256x128 branch April 11, 2025 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants