Simple script to test the capability of TMA and cp.async to saturate the L2 bandwidth.
bash run.shAlso plot the heatmap like this one on RTX 5090:
I came into a problem that the write traffic from L1 to L2 on GPU is way higher than my expectation. After some research, the local memory becomes very suspicious. This is a simple script to verify that GPU handles local memory with a write through + write allocate policy -- we have full write traffic but none read traffic.