skip to main content
research-article
Open access

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Published: 08 March 2019 Publication History

Abstract

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, LLT misses incur long address translation latency and hurt performance. This article proposes two low-overhead hardware mechanisms for reducing the frequency and penalty of on-die LLT misses. The first, Unified CAche and TLB (UCAT), enables the conventional on-die Last-Level Cache to store cache lines and TLB entries in a single unified structure and increases on-die TLB capacity significantly. The second, DRAM-TLB, memoizes virtual to physical address translations in DRAM and reduces LLT miss penalty when UCAT is unable to fully cover total application working-set. DRAM-TLB serves as the next larger level in the TLB hierarchy that significantly increases TLB coverage relative to on-chip TLBs. The combination of these two mechanisms, DUCATI, is an address translation architecture that improves GPU performance by 81%; (up to 4.5×) while requiring minimal changes to the existing system design. We show that DUCATI is within 20%, 5%, and 2% the performance of a perfect LLT system when using 4KB, 64KB, and 2MB pages, respectively.

References

[1]
Neha Agarwal, David Nellans, Mark Stephenson, Mike O’Connor, and Stephen W. Keckler. 2015. Page placement strategies for GPUs within heterogeneous memory systems. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15).
[2]
ATS. 2009. PCI Express, Address Translation Service. Retrieved from http://composter.com.ua/documents/ats_r1.1_26Jan09.pdf.
[3]
Thomas W. Barr, Alan Cox, and Scott Rixner. 2010. Translation caching: Skip, don’t walk (the page table). In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA’10).
[4]
Thomas W. Barr, Alan L. Cox, and Scott Rixner. 2011. SpecTLB: A mechanism for speculative address translation. In ACM SIGARCH Computer Architecture News, Vol. 39. ACM, 307--318.
[5]
Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient virtual memory for big memory servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture.
[6]
Ravi Bhargava, Benjamin Serebrin, Francesco Spadini, and Srilatha Manne. 2008. Accelerating two-dimensional page walks for virtualized systems. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XIII). ACM, New York, NY, 26--35.
[7]
Abhishek Bhattacharjee. 2013. Large-reach memory management unit caches. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, 383--394.
[8]
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared last-level TLBs for chip multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE Computer Society, Los Alamitos, CA, 62--63.
[9]
Abhishek Bhattacharjee and Margaret Martonosi. 2010. Inter-core cooperative TLB for chip multiprocessors. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XV). ACM, New York, NY, 359--370.
[10]
W. Bolosky, R. Fitzgerald, and M. Scott. 1989. Simple but effective techniques for NUMA memory management. In Proceedings of the 12th ACM Symposium on Operating Systems Principles (SOSP’89). ACM, New York, NY, 19--31.
[11]
J. Bradley Chen, Anita Borg, and Norman P. Jouppi. 1992. A simulation based study of TLB performance. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). ACM, New York, NY, 114--123.
[12]
Chiachen Chou, Aamer Jaleel, and Moin Qureshi. 2015b. BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches. In Proceedings of the 42nd Annual International Symposium on Computer Architecture.
[13]
Chiachen Chou, Aamer Jaleel, and Moin K. Qureshi. 2014. CAMEO: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO47).
[14]
Chiachen Chou, Aamer Jaleel, and M K Qureshi. 2015a. BATMAN: Maximizing bandwidth utilization for hybrid memory systems. Technical Report for Computer ARchitecture and Emerging Technologies (CARET) Lab, TR-CARET-2015-01.
[15]
CORAL. 2014. CORAL Procurement Benchmarks. Retrieved from https://asc.llnl.gov/CORAL-benchmarks/.
[16]
Jonathan Corbet. 2017. Five-level page tables. Retrieved from https://lwn.net/Articles/717293/.
[17]
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic management: A holistic approach to memory placement on NUMA systems. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, New York, NY, 381--394.
[18]
Alejandro Duran and Michael Klemm. 2012. The intel® many integrated core architecture. In Proceedings of the 2012 International Conference on High Performance Computing and Simulation (HPCS’12). IEEE, 365--366.
[19]
John Feehrer, Sumti Jairath, Paul Loewenstein, Ram Sivaramakrishnan, David Smentek, Sebastian Turullols, and Ali Vahidsafa. 2013. The oracle sparc T5 16-core processor scales to eight sockets. IEEE Micro 33, 2 (2013), 48--57.
[20]
Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large pages may be harmful on NUMA systems. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC’14). USENIX Association, Berkeley, CA, 231--242. http://dl.acm.org/citation.cfm?id=2643634.2643659
[21]
NVIDIA GP100. 2016. P100 GPU Accelerator.
[22]
Michael A. Heroux, Douglas W. Doerfler, Paul S. Crozier, James M. Willenbring, H. Carter Edwards, Alan Williams, Mahesh Rajan, Eric R. Keiter, Heidi K. Thornquist, and Robert W. Numrich. 2009. Improving performance via mini-applications. Sandia National Laboratories, Tech. Rep. SAND2009-5574 3 (2009).
[23]
HMC Specification 1.0. Retrieved from http://www.hybridmemorycube.org, 2013.
[24]
HSA Foundation 2014. HSA Platform System Architecture Specification. HSA Foundation. Retrieved from http://www.slideshare.net/hsafoundation/hsa-platform-system-architecture-specification-provisional-verl-10-ratifed.
[25]
Intel. 2009. Intel 64 and IA-32 Architectures Optimization Reference Manual.
[26]
Bruce Jacob and Trevor Mudge. 1998. Virtual memory: Issues of implementation. IEEE Comput. 31, 6 (Jun. 1998), 33--43.
[27]
Aamer Jaleel, Kevin B. Theobald, Simon C. Steely Jr., and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 37th Annual International Symposium on Computer Architecture. 12.
[28]
JEDEC. 2013a. DDR4 SPEC (JESD79-4). JEDEC.
[29]
JEDEC. 2013b. High Bandwidth Memory (HBM) DRAM (JESD235).
[30]
James Jeffers, James Reinders, and Avinash Sodani. 2016. Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition. Morgan Kaufmann.
[31]
D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. 2014. Unison cache: A scalable and effective die-stacked DRAM cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 25--37.
[32]
Zhipeng Jiang, Xiaodong Hu, and Suixiang Gao. 2013. A parallel ford-fulkerson algorithm for maximum flow problem. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA’13).
[33]
Daniel A. Jiménez. 2013. Insertion and promotion for tree-based pseudolru last-level caches. In Proceedings of the 46th Annual International Symposium on Microarchitecture. 13.
[34]
Stephen Junkins. 2015. The compute architecture of intel processor graphics gen9. Intel Whitepaper v1 (2015).
[35]
Gokul B. Kandiraju and Anand Sivasubramaniam. 2002. Going the distance for TLB prefetching: An application-driven study. In Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA’02). IEEE Computer Society, Washington, DC, USA, 195--206. http://dl.acm.org/citation.cfm?id=545215.545237
[36]
Dimitris Kaseridis, Jeffrey Stuecheli, and Lizy Kurian John. 2011. Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO44). ACM, New York, NY, 24--35.
[37]
Samira M. Khan, Daniel A. Jiménez, and Doug Burgerand Babak Falsafi. 2010. Using dead blocks as a virtual victim cache. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT-19).
[38]
Milind A. Kulkarni, Martin A. Burtscher, Calin Cascaval, and Keshav Pingali. 2009. Lonestar: A Suite of Parallel Irregular Programs?
[39]
Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA’12). IEEE, 1--12.
[40]
Gabriel H. Loh and Mark D. Hill. 2011. Efficiently enabling conventional block sizes for very large die-stacked DRAM caches. In Proceedings of the 44th Annual International Symposium on Microarchitecture. 11.
[41]
Piotr R. Luszczek, David H. Bailey, Jack J. Dongarra, Jeremy Kepner, Robert F. Lucas, Rolf Rabenseifner, and Daisuke Takahashi. 2006. The HPC challenge (HPCC) benchmark suite. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. 213.
[42]
Joe Macri. 2015. AMD’s next generation GPU and high bandwidth memory architecture: FURY. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS’15). IEEE, 1--26.
[43]
Khalid Moammer. 2016. AMD Zen Raven Ridge APU Features HBM, 128GB/s of Bandwidth and Large GPU.
[44]
Dan Negrut, Radu Serban, Ang Li, and Andrew Seidl. 2014. Unified memory in cuda 6.0. a brief overview of related data access and transfer issues. Tech. Rep. TR-2014--09, University of Wisconsin—Madison.
[45]
Binh Pham, Arup Bhattacharjee, Yasuko Eckert, and Gabriel H. Loh. 2014. Increasing TLB reach by exploiting clustering in page translations. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE.
[46]
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced large-reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Los Alamitos, CA, 258--269.
[47]
Binh Pham, Jan Vesely, Gabriel Loh, and Abhishek Bhattacharjee. 2015. Large pages and lightweight memory management in virtualized systems: Can you have it both ways? In Proceedings of the International Symposium on Microarchitecture (MICRO).
[48]
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural support for address translation on GPUs: Designing memory management units for CPU/GPUs with unified address spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14). ACM, New York, NY, 16.
[49]
Jonathan Power, Mark D. Hill, David Wood, et al. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). IEEE, 568--578.
[50]
Moinuddin K. Qureshi, Aamer Jaleel, Yale N. Patt, Simon C. Steely, and Joel Emer. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture. 11.
[51]
Moinuddin K. Qureshi and Gabe H. Loh. 2012. Fundamental latency trade-off in architecting DRAM caches: Outperforming impractical SRAM-Tags with a simple and practical design. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 12.
[52]
Richard Rashid, Avadis Tevanian, Michael Young, David Golub, Robert Baron, David Black, William Bolosky, and Jonathan Chew. 1988. Machine-independent virtual memory management for paged uniprocessor and multiprocessor architectures. IEEE Transactions on Computers 37, 8 (1988), 896--908.
[53]
Jee Ho Ryoo, Nagendra Gulur, Shuang Song, and Lizy K. John. 2017. Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB. In Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 469--480.
[54]
Ashley Saulsbury, Fredrik Dahlgren, and Per Stenstrom. 2000. Recency-based TLB preloading. In Proceedings of the 27th Annual International Symposium on Computer Architecture (ISCA’00). ACM, New York, NY, 117--127.
[55]
Jaewoong Sim, Gabriel H. Loh, Hyesoon Kim, Mike O’Connor, and Mithuna Thottethodi. 2012. A mostly-clean DRAM cache for effective hit speculation and self-balancing dispatch. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture. 11.
[56]
Jaewoong Sim, Gabriel H. Loh, Vilas Sridharan, and Mike O’Connor. 2013. Resilient die-stacked DRAM caches. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA’13). ACM, New York, NY, 416--427.
[57]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights landing: Second-generation intel xeon phi product. IEEE Micro 36, 2 (2016), 34--46.
[58]
Madhusudhan Talluri and Mark D. Hill. 1994. Surpassing the TLB performance of superpages with less operating system support. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VI). ACM, New York, NY, 171--182.
[59]
Madhusudhan Talluri, Shing Kong, Mark D. Hill, and David A. Patterson. 1992. Tradeoffs in supporting two page sizes. In Proceedings of the 19th Annual International Symposium on Computer Architecture (ISCA’92). ACM, New York, NY, 415--424.
[60]
Ben Verghese, Scott Devine, Anoop Gupta, and Mendel Rosenblum. 1996. Operating system support for improving data locality on CC-NUMA compute servers. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS VII). ACM, New York, NY, 279--289.
[61]
Jan Vesely, Arkaprava Basu, Mark Oskin, Gabriel H. Loh, and Abhishek Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In Proceedings of the 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’16). IEEE, 161--171.
[62]
David A. Wood, Susan J. Eggers, Garth Gibson, Mark D. Hill, and Joan M. Pendleton. 1986. An in-cache address translation mechanism. In ACM SIGARCH Computer Architecture News, Vol. 14. IEEE Computer Society Press, 358--365.
[63]
Carole-Jean Wu, Aamer Jaleel, Will Hasenplaugh, Margaret Martonosi, Jr. Simon C. Steely, and Joel Emer. 2011. SHiP: Signature-based hit predictor for high performance caching. In Proceedings of the 2012 45th Annual International Symposium on Microarchitecture (Micro-44).
[64]
Vinson Young, Chiachen Chou, Aamer Jaleel, and Moinuddin Qureshi. 2018a. ACCORD: Enabling associativity for gigascale DRAM caches by coordinating way-install and way-prediction. In Proceedings of the 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA’18). IEEE, 328--339.
[65]
Vinson Young, Aamer Jaleel, Evgeny Bolotin, Eiman Ebrahimi, David Nellans, and Oreste Villa. 2018b. Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems. In Proceedings of the 2018 IEEE 51st International Symposium on Microarchitecture (MICRO51). IEEE.
[66]
Tianhao Zheng, David Nellans, Arslan Zulfiqar, Mark Stephenson, and Stephen W. Keckler. 2016. Towards high performance paged memory for GPUs.

Cited By

View all
  • (2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
  • (2024)Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page TableProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676900(325-337)Online publication date: 14-Oct-2024
  • (2024)Low-Overhead General-Purpose Near-Data Processing in CXL Memory Expanders2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00051(594-611)Online publication date: 2-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 1
March 2019
157 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3313806
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 March 2019
Accepted: 01 January 2019
Revised: 01 January 2019
Received: 01 February 2018
Published in TACO Volume 16, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. TLB
  3. caches
  4. high bandwidth memory
  5. virtual memory

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)730
  • Downloads (Last 6 weeks)80
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)VPRI: Efficient I/O Page Fault Handling via Software-Hardware Co-Design for IaaS CloudsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695957(541-557)Online publication date: 4-Nov-2024
  • (2024)Rethinking Page Table Structure for Fast Address Translation in GPUs: A Fixed-Size Hashed Page TableProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676900(325-337)Online publication date: 14-Oct-2024
  • (2024)Low-Overhead General-Purpose Near-Data Processing in CXL Memory Expanders2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00051(594-611)Online publication date: 2-Nov-2024
  • (2024)SUV: Static Analysis Guided Unified Virtual Memory2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00030(293-308)Online publication date: 2-Nov-2024
  • (2024)A Case for Speculative Address Translation with Rapid Validation for GPUs2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00029(278-292)Online publication date: 2-Nov-2024
  • (2024)Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00065(834-847)Online publication date: 29-Jun-2024
  • (2024)GPU Scale-Model Simulation2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00088(1125-1140)Online publication date: 2-Mar-2024
  • (2023)GPU Performance Acceleration via Intra-Group Sharing TLBProceedings of the 52nd International Conference on Parallel Processing10.1145/3605573.3605593(705-714)Online publication date: 7-Aug-2023
  • (2023)SnakeByte: A TLB Design with Adaptive and Recursive Page Merging in GPUs2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071063(1195-1207)Online publication date: Feb-2023
  • (2023)Trans-FW: Short Circuiting Page Table Walk in Multi-GPU Systems via Remote Forwarding2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA56546.2023.10071054(456-470)Online publication date: Feb-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media