Loading [a11y]/accessibility-menu.js
Countering Load-to-Use Stalls in the NVIDIA Turing GPU | IEEE Journals & Magazine | IEEE Xplore

Countering Load-to-Use Stalls in the NVIDIA Turing GPU


Abstract:

Among its various improvements over prior NVIDIA GPUs, the NVIDIA Turing GPU boasts of four key performance enhancements to effectively counter memory load-to-use stalls....Show More

Abstract:

Among its various improvements over prior NVIDIA GPUs, the NVIDIA Turing GPU boasts of four key performance enhancements to effectively counter memory load-to-use stalls. First, reduced latency on L1 hits for global memory loads helps lower average memory lookup latency. Next, the ability to dynamically configure the L1 data RAM between cacheable memory and scratchpad or shared memory, enables driver software to opportunistically maximize L1 data cache size for programs with low shared memory requirements, increasing L1 hits and reducing load-to-use stalls. Finally, the twin enhancements of doubling of vector register file capacity and the addition of a dedicated scalar or uniform register file along with a uniform datapath, ease vector register pressure and enable higher warp level parallelism, leading to better latency tolerance. We find that the above enhancements combined deliver an average speedup of 11% on modern gaming applications.
Published in: IEEE Micro ( Volume: 40, Issue: 6, 01 Nov.-Dec. 2020)
Page(s): 59 - 66
Date of Publication: 28 July 2020

ISSN Information:


References

References is not available for this document.