skip to main content
10.1145/3470496.3533045acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article
Open access

Mixed-proxy extensions for the NVIDIA PTX memory consistency model: industrial product

Published: 11 June 2022 Publication History

Abstract

In recent years, there has been a trend towards the use of accelerators and architectural specialization to continue scaling performance in spite of a slowing of Moore's Law. GPUs have always relied on dedicated hardware for graphics workloads, but modern GPUs now also incorporate compute-domain accelerators such as NVIDIA's Tensor Cores for machine learning. For these accelerators to be successfully integrated into a general-purpose programming language such as C++ or CUDA, there must be a forward- and backward-compatible API for the functionality they provide. To the extent that all of these accelerators interact with program threads through memory, they should be incorporated into the GPU's memory consistency model. Unfortunately, the use of accelerators and/or special non-coherent paths into memory produces non-standard memory behavior that existing GPU memory models cannot capture.
In this work, we describe the "proxy" extensions added to version 7.5 of NVIDIA's PTX ISA for GPUs. A proxy is an extra tag abstractly applied to every memory or fence operation. Proxies generalize the notion of address translation and specialized non-coherent cache hierarchies into an abstraction that cleanly describes the resulting non-standard behavior. The goal of proxies is to facilitate integration of these specialized memory accesses into the general-purpose PTX programming model in a fully composable manner. This paper characterizes the behaviors that proxies can capture, the microarchitectural intuition behind them, the necessary updates to the formal memory model, and the tooling that we built in order to ensure that the resulting model both is sound and meets the needs of business-critical workloads that they are designed to support.

References

[1]
Parosh Aziz Abdulla, Mohamed Faouzi Atig, Yu-Fang Chen, Carl Leonardsson, and Ahmed Rezine. 2012. Counter-example Guided Fence Insertion Under TSO. In Proceedings of the Eighteenth International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS).
[2]
Jade Alglave, Mark Batty, Alistair F. Donaldson, Ganesh Gopalakrishnan, Jeroen Ketema, Daniel Poetzl, Tyler Sorensen, and John Wickerson. 2015. GPU Concurrency: Weak Behaviours and Programming Assumptions. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[3]
Jade Alglave, Anthony Fox, Samin Ishtiaq, Magnus O. Myreen, Susmit Sarkar, Peter Sewell, and Francesco Zappa Nardelli. 2008. The Semantics of Power and ARM Multiprocessor Machine Code. In Proceedings of the Fourth Workshop on Declarative Aspects of Multicore Programming (DAMP).
[4]
Jade Alglave, Daniel Kroening, Vincent Nimal, and Daniel Poetzl. 2017. Don't Sit on the Fence: A Static Analysis Approach to Automatic Fence Insertion. In ACM Transactions on Programming Languages and Systems (TOPLAS), Vol. 39. Issue 2.
[5]
Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. Herding Cats: Modelling, Simulation, Testing, and Data-mining for Weak Memory. In Proceedings of the Thirty-Fifth ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[6]
ARM. 2021. Arm Architecture Reference Manual Armv8, for A-Profile Architecture. https://developer.arm.com/documentation/ddi0487/gb
[7]
Mark Batty, Kayvan Memarian, Kyndylan Nienhuis, Jean Pichon-Pharabod, and Peter Sewell. 2015. The Problem of Programming Language Concurrency Semantics. In Proceedings of the Twenth-Fourth European Symposium on Programming Languages and Systems (ESOP).
[8]
Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 2011. Mathematizing C++ Concurrency. In Proceedings of the Thirty-Eighth ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL).
[9]
Hans-J. Boehm. 2013. Prohibiting "Out of Thin Air" Results in C++14. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3786.htm.
[10]
Hans-J. Boehm and Sarita V. Adve. 2008. Foundations of the C++ Concurrency Memory Model. In Proceedings of the Twenty-Ninth International Conference on Programming Language Design and Implementation (PLDI).
[11]
Hans-J. Boehm, Mark Batty, Olivier Giroux, Paul McKenney, Peter Sewell, and Francesco Zappa Nardelli. 2013. Specifying the Absence of "Out of Thin Air" Results. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html.
[12]
Hans-J. Boehm and Brian Demsky. 2014. Outlawing Ghosts: Avoiding Out-ofthin-air Results. In Proceedings of the Workshop on Memory Systems Performance and Correctness (MSPC).
[13]
James Bornholt and Emina Torlak. 2017. Synthesizing Memory Models from Framework Sketches and Litmus Tests. In Proceedings of the Thirty-Eighth ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[14]
Xing Fang, Jaejin Lee, and Samuel P Midkiff. 2003. Automatic Fence Insertion for Shared Memory Multiprocessing. In Proceedings of the Seventeenth Annual International Conference on Supercomputing (SC).
[15]
Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, Luc Maranget, Kathryn E. Gray, Ali Sezgin, Mark Batty, and Peter Sewell. 2017. Mixed-size Concurrency: ARM, POWER, C/C++11, and SC. In Proceedings of the Forty-Fourth ACM SIGPLAN Symposium on Principles of Programming Languages (POPL).
[16]
HSA Foundation. 2015. HSA Platform System Architecture Specification. http://www.hsafoundation.com/html/Content/PRM/Topics/06_Memory/memory_model.htm.
[17]
RISC-V Foundation. 2022. The RISC-V Instruction Set Manual, Volume I: Unprivileged ISA. https://riscv.org.
[18]
Matt Godbolt. 2022. Compiler Explorer. https://godbolt.org/
[19]
Derek R Hower, Blake A Hechtman, Bradford M Beckmann, Benedict R Gaster, Mark D Hill, Steven K Reinhardt, and David A Wood. 2014. Heterogeneous-Race-Free Memory Models. In Proceedings of the Nineteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[20]
Daniel Jackson. 2002. Alloy: A Lightweight Object Modelling Notation. ACM Transactions on Software Engineering and Methodology 11, 2, 256--290.
[21]
Alan Jeffrey and James Riely. 2016. On Thin Air Reads Towards an Event Structures Model of Relaxed Memory. In Proceedings of the 31st Annual ACM/IEEE Symposium on Logic in Computer Science (LICS).
[22]
Alan Jeffrey, James Riely, Mark Batty, Simon Cooksey, Ilya Kaysin, and Anton Podkopaev. 2022. The Leaky Semicolon: Compositional Semantic Dependencies for Relaxed-Memory Concurrency. In Proceedings of the 49th ACM SIGPLAN Symposium on Principles of Programming Languages (POPL).
[23]
Jeehoon Kang, Chung-Kil Hur, Ori Lahav, Viktor Vafeiadis, and Derek Dreyer. 2017. A Promising Semantics for Relaxed-memory Concurrency. In Proceedings of the Forty-Fourth ACM SIGPLAN Symposium on Principles of Programming Languages (POPL).
[24]
Khronos Vulkan Working Group. 2018. Vulkan 1.1.92 - A Specification. https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html.
[25]
Michael Kuperstein, Martin Vechev, and Eran Yahav. 2012. Automatic Inference of Memory Fences. In Proceedings of the Tenth Conference on Formal Methods in Computer Aided Design (FMCAD).
[26]
Ori Lahav, Viktor Vafeiadis, Jeehoon Kang, Chung-Kil Hur, and Derek Dreyer. 2017. Repairing Sequential Consistency in C/C++11. In Proceedings of the Thirty-Eighth ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[27]
Daniel Lustig, Sameer Sahasrabuddhe, and Olivier Giroux. 2019. A Formal Analysis of the NVIDIA PTX Memory Consistency Model. In Proceedings of the Twenty-Third ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[28]
Daniel Lustig, Andrew Wright, Alexandras Papakonstantinou, and Olivier Giroux. 2017. Automated Synthesis of Comprehensive Memory Model Litmus Test Suites. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[29]
Sela Mador-Haim, Luc Maranget, Susmit Sarkar, Kayvan Memarian, Jade Alglave, Scott Owens, Rajeev Alur, Milo M. K. Martin, Peter Sewell, and Derek Williams. 2012. An Axiomatic Memory Model for POWER Multiprocessors. In Proceedings of the Twenty-Fourth International Conference on Computer Aided Verification (CAV).
[30]
Yatin A. Manerkar, Caroline Trippel, Daniel Lustig, Michael Pellauer, and Margaret Martonosi. 2016. Counterexamples and Proof Loophole for the C/C++ to POWER and ARMv7 Trailing-Sync Compiler Mappings. In ArXiv, Vol. abs/1611.01507.
[31]
Jeremy Manson, William Pugh, and Sarita V. Adve. 2005. The Java Memory Model. In Proceedings of the 32nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL).
[32]
Luc Maranget and Jade Alglave. 2015. Towards a Formalization of the HSA Memory Model in the cat Language. https://hal.inria.fr/hal-01413251
[33]
Paul E. McKenny, Alan Jeffrey, Ali Sezgin, and Tony Tye. 2016. P0422: Out-of-Thin-Air Execution is Vacuous. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0422r0.html.
[34]
MLCommons. 2021. General MLPerf Submission Rules v0.3. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
[35]
NVIDIA. 2017. NVIDIA Tesla V100 GPU Architecture. http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
[36]
NVIDIA. 2021. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[37]
NVIDIA. 2021. PTX ISA :: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
[38]
Marc S. Orr, Shuai Che, Ayse Yilmazer, Bradford M. Beckmann, Mark D. Hill, and David A. Wood. 2015. Synchronization Using Remote-Scope Promotion. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).
[39]
Marco Pavoitti, Simon Cooksey, Anouk Paradis, Daniel Wright, Scott Owens, and Mark Batty. 2020. Modular Relaxed Dependencies in Weak Memory Concurrency. In Proceedings of the Twenty-Ninth European Symposium on Programming (ESOP).
[40]
Jean Pichon-Pharabod and Peter Sewell. 2016. A Concurrency Semantics for Relaxed Atomics That Permits Optimisation and Avoids Thin-air Executions. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL).
[41]
Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2018. Simplifying ARM Concurrency: Multicopy-atomic Axiomatic and Operational Models for ARMv8. In Proceedings of the Forty-Fifth ACM SIGPLAN Symposium on Principles of Programming Languages (POPL).
[42]
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors. In Communications of the ACM, Vol. 53. Issue 7.
[43]
Ben Simner, Alasdair Armstrong, Jean Pichon-Pharabod, Christopher Pulte, Richard Grisenthwaite, and Peter Sewell. 2022. Relaxed virtual memory in Armv8-A. In Proceedings of the Thirty-First European Symposium on Programming (ESOP).
[44]
Ben Simner, Shaked Flur, Christopher Pulte, Alasdair Armstrong, Jean Pichon-Pharabod, Luc Maranget, and Peter Sewell. 2020. ARMv8-A System Semantics: Instruction Fetch in Relaxed Architectures. In Proceedings of the Twenty-Ninth European Symposium on Programming (ESOP).
[45]
Matthew D Sinclair, Johnathan Alsop, and Sarita V Adve. 2015. Efficient GPU Synchronization Without Scopes: Saying No to Complex Consistency Models. In Proceedings of the Forty-Eighth International Symposium on Microarchitecture (MICRO).
[46]
Matthew D. Sinclair, Johnathan Alsop, and Sarita V. Adve. 2017. Chasing Away RAts: Semantics and Evaluation for Relaxed Atomics on Heterogeneous Systems. In Proceedings of the Forty-Fourth Annual International Symposium on Computer Architecture (ISCA).
[47]
Tyler Sorensen and Alastair F. Donaldson. 2016. Exposing Errors Related to Weak Memory in GPU Applications. In Proceedings of the Thirty-Seventh ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI).
[48]
John Wickerson, Mark Batty, Tyler Sorensen, and George A. Constantinides. 2017. Automatically Comparing Memory Consistency Models. In Proceedings of the Forty-Fourth ACM SIGPLAN Symposium on Principles of Programming Languages (POPL).
[49]
John Wickerson, Mark Batty, Tyler Sorensen, and George A Constantinides. 2017. Automatically Comparing Memory Consistency Models. In Proceedings of the Forty-Fourth ACM SIGPLAN Symposium on Principles of Programming Languages (POPL).

Cited By

View all
  • (2024)Formalisation of a New Weak Semantics for AuDaLaAutomated Technology for Verification and Analysis10.1007/978-3-031-78750-8_5(93-116)Online publication date: 21-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
June 2022
1097 pages
ISBN:9781450386104
DOI:10.1145/3470496
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. memory consistency
  3. memory ordering
  4. synchronization

Qualifiers

  • Research-article

Conference

ISCA '22
Sponsor:

Acceptance Rates

ISCA '22 Paper Acceptance Rate 67 of 400 submissions, 17%;
Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)585
  • Downloads (Last 6 weeks)79
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Formalisation of a New Weak Semantics for AuDaLaAutomated Technology for Verification and Analysis10.1007/978-3-031-78750-8_5(93-116)Online publication date: 21-Oct-2024

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media