Skip to content

Conversation

@soodoshll
Copy link
Collaborator

Performance on H100:

      m     n     k   name  latency (ms)      tflops
0  4096  4096  4096  torch      0.176928  776.807225
1  4096  4096  4096  tilus      0.216320  635.350190

Does it match with your previous experience? @yaoyaoding

Signed-off-by: Qidong Su <soodoshll@gmail.com>
Signed-off-by: Qidong Su <soodoshll@gmail.com>
Signed-off-by: Qidong Su <soodoshll@gmail.com>
Signed-off-by: Qidong Su <soodoshll@gmail.com>
Signed-off-by: Qidong Su <soodoshll@gmail.com>
@yaoyaoding
Copy link
Member

Yes, thanks @soodoshll !

The number looks good to me.

There might be some other optimizations to reach the cublas performance:

  • warp specialization
  • persistent thread block
  • more efficient write back

cutlass is a good source of such optimizations.

At the end, we need to use ncu and analyze the generated cuda/ptx/sass to optimize the last bit of performance.

@yaoyaoding yaoyaoding merged commit 1e27ca3 into NVIDIA:main Dec 7, 2025
8 checks passed
@yaoyaoding yaoyaoding mentioned this pull request Dec 7, 2025
17 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants