research-article

Transient Fault Detection in Tensor Cores for Modern GPUs

Authors:

Mohammad Hassan Hafezan,

Ehsan AtoofianAuthors Info & Claims

ACM Transactions on Embedded Computing Systems, Volume 23, Issue 5

Article No.: 82, Pages 1 - 29

https://doi.org/10.1145/3687483

Published: 28 August 2024 Publication History

Abstract

Deep neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of the TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance.

In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. The high level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with a low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs.

References

[1]

N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’07). 3–14.

Digital Library

[2]

Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA ’17). IEEE, 27–40.

[3]

J. Zhang, X. Chen, M. Song, and T. Li. 2019. Eager Pruning: Algorithm and architecture support for fast training of deep neural networks. In Proceedings of the 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA ’19). 292–303.

[4]

R. Nathan and D. J. Sorin. 2015. Argus-G: Comprehensive, low-cost error detection for GPGPU cores. IEEE Computer Architecture Letters 14, 1 (2015), 13–16.

Digital Library

[5]

M. A. Raihan, N. Goli, and T. M. Aamodt. 2019. Modeling deep learning accelerator enabled GPUs. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’19). 79–92.

[6]

J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. 2008. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Proceedings of the 43rd International Symposium on Computer Architecture.

[7]

NVIDIA Corporation. 2008. cuBLAS Developer Guide. Retrieved August 12, 2024 from https://docs.nvidia.com/cuda/cublas/index.html

[8]

M. Dimitrov, M. Mantor, and H. Zhou. 2009. Understanding software approaches for GPGPU reliability. In Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units (GPGPU ’09). 94–104.

Digital Library

[9]

S. Li, V. Sridharan, S. Gurumurthi, and S. Yalamanchili. 2014. Position paper: Software-based techniques for reducing the vulnerability of GPU applications. In Proceedings of the Dependable GPU Computing Workshop of the ACM/IEEE DATE 2014 Conference. 1–5.

[10]

H. Mostafa and X. Wang. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the International Conference on Machine Learning. 4646–4655.

[11]

E. Atoofian. 2023. PTTS: Power-aware tensor cores using two-sided sparsity. Journal of Parallel and Distributed Computing 173 (2023), 70–82.

Digital Library

[12]

H. Sak, A. W. Senior, and F. Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH ’14). 338–342.

[13]

NVIDIA Corporation. 2022. NVIDIA A100 Tensor Core GPU Architecture. Retrieved August 12, 2024 from https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf

[14]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84–90.

Digital Library

[15]

C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software 5, 3 (1979), 308–323.

Digital Library

[16]

Cerebras. 2023. Cerebras CS2. Available at https://cerebras.net/wp-content/Cerebras-CS2-virtual-tour/#/

[17]

NC State University. n.d. FreePDK Process Design Kit. Retrieved August 12, 2024 from http://www.eda.ncsu.edu/wiki/FreePDK

[18]

NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture: The World's Most Advanced Data Center GPU. White Paper WP-80608 v1.1. NVIDIA.

[19]

NVIDIA Corporation. 2018. NVIDIA Turing GPU Architecture. White Paper WP-09183-001 v01. NVIDIA.

[20]

I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 1–9.

[21]

H. Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 37–47.

Digital Library

[22]

K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[23]

K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for GPGPU. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium. 287–300.

Digital Library

[24]

S. Markidis, S. Chien, E. Laure, I. Peng, and J. Vetter. 2018. NVIDIA tensor core programmability, performance, and precision. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops. 522–531.

[25]

L. Yang, B. Nie, A. Jog, and E. Smirni. 2021. Practical resilience analysis of GPGPU applications in the presence of single- and multi-bit faults. IEEE Transactions on Computers 70, 1 (2021), 30–44.

[26]

D. J. Palframan, N. S. Kim, and M. H. Lipasti. 2014. Precision-aware soft error protection for GPUs. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA ’14). 49–59.

[27]

Y. Wang, C. Zhang, Z. Xie, C. Guo, Y. Liu, and J. Leng. 2021. Dual-side sparse tensor core. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA ’21). 1083–1095.

[28]

M. Zhu, T. Zhang, Z. Gu, and Y. Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.

Digital Library

[29]

C. Braun, S. Halder, and H. J. Wunderlich. 2014. A-ABFT: Autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 443–454.

Digital Library

[30]

C. Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’15). 1–9.

[31]

L. Zhang, Y. Han, Q. Xu, and X. Li. 2008. Defect tolerance in homogeneous manycore processors using core-level redundancy with unified topology. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition. 891–896.

Digital Library

[32]

UC Irvine Machine Learning Repository. 2013. Daily and Sports Activities: Donated on 7/7/2013. Retrieved August 12, 2024 from https://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities

[33]

D. P. Siewiorek and R. S. Swarz. 1992. Reliable Computer Systems: Design and Evaluation (2nd ed.). Digital Press.

[34]

N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, and F. Kawsar. 2015. An early resource characterization of deep learning on wearables, smartphones and Internet-of-Things devices. In Proceedings of the 2015 International Workshop on Internet of Things Towards Applications. 7–12.

[35]

NVIDIA Corporation. 2017. Programming Tensor Cores in CUDA 9. Retrieved August 12, 2024 from https://docs.nvidia.com/cuda/archive/9.0/

[36]

NVIDIA Corporation. 2017. CUDA C Programming Guide (CUDA 9.0). Retrieved August 12, 2024 from https://docs.nvidia.com/cuda/archive/9.0/cuda-c-programming-guide/index.html

[37]

UC Irvine Machine Learning Repository. 1994. ISOLET: Donated on 9/11/1994. Retrieved August 12, 2024 from http://archive.ics.uci.edu/ml/datasets/ISOLET

[38]

Y. LeCun, C. Cortes, and C. J. Burges. 1998. MNIST Database of Handwritten Digits. Retrieved August 12, 2024 from https://yann.lecun.com/exdb/mnist/

[39]

A. Krizhevsky. n.d. The CIFAR-10 Dataset. Retrieved August 12, 2024 from https://www.cs.toronto.edu/~kriz/cifar.html

[40]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.

Digital Library

[41]

GitHub. 2020. Caffe Model Zoo. Retrieved August 12, 2024 from https://github.com/BVLC/caffe/wiki/Model-Zoo

[42]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174.

[43]

M. Abdel-Majeed, W. Dweik, H. Jeon, and M. Annavaram. 2015. Warped-RE: Low-cost error detection and correction in GPUs. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 331–342.

[44]

NVIDIA Corporation. n.d. CUTLASS: Fast Linear Algebra in CUDA C++. Retrieved August 12, 2024 from https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/.

[45]

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. 2023. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning. 1139–1147.

[46]

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. 2010. Debunking the 100x GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 451–460.

[47]

A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A sparse tensor accelerator for convolutional neural networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 151–165.

Digital Library

[48]

T. Dettmers and L. Zettlemoyer. 2019. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840 (2019).

[49]

J. Redmon and A. Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17). 6517–6525. IEEE.

[50]

K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16). 770–778.

[51]

F. F. dos Santos, P. F. Pimenta, C. Lunardi, L. Draghetti, L. Carro, D. Kaeli, and P. Rech. 2019. Analyzing and increasing the reliability of convolutional neural networks on GPUs. IEEE Transactions on Reliability 68, 2 (2019), 663–677.

[52]

A. Mahmoud, S. K. S. Hari, C. W. Fletcher, S. V. Adve, C. Sakr, N. R. Shanbhag, P. Molchanov, M. B. Sullivan, T. Tsai, and S. W. Keckler. 2021. Optimizing selective protection for CNN resilience. In Proceedings of the 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE ’21). 127–138.

[53]

C. Bolchini, L. Cassano, A. Miele, and A. Nazzari. 2022. Selective hardening of CNNs based on layer vulnerability estimation. In Proceedings of the 2022 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT ’22). 1–6.

[54]

P. M. Basso, F. F. dos Santos, and P. Rech. 2020. Impact of tensor cores and mixed precision on the reliability of matrix multiplication in GPUs. IEEE Transactions on Nuclear Science 67, 7 (2020), 1560–1565.

[55]

J. Kosaian and K. V. Rashmi. 2021. Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’21). 1–15.

[56]

S. K. S. Hari, M. B. Sullivan, T. Tsai, and S. W. Keckler. 2022. Making convolutions resilient via algorithm-based error detection techniques. IEEE Transactions on Dependable and Secure Computing 19, 4 (2022), 2546–2558.

[57]

M. B. Sullivan, S. K. S. Hari, B. Zimmer, T. Tsai, and S. W. Keckler. 2018. SwapCodes: Error codes for hardware-software cooperative GPU pipeline error detection. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’18). 762–774.

Digital Library

[58]

D. A. G. Oliveira, P. Rech, H. M. Quinn, T. D. Fairbanks, L. Monroe, S. E. Michalak, C. Anderson-Cook, P. O. A. Navaux, and L. Carro. 2014. Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Transactions on Nuclear Science 61, 6 (2014), 3115–3122.

[59]

N. Cavagnero, F. D. Santos, M. Ciccone, G. Averta, T. Tommasi, and P. Rech. 2022. Transient-fault-aware design and training to enhance DNNs reliability with zero-overhead. In Proceedings of the 2022 IEEE 28th International Symposium on On-Line Testing and Robust System Design (IOLTS ’22). 1–7.

[60]

U. Zahid, G. Gambardella, N. J. Fraser, M. Blott, and K. Vissers. 2020. FAT: Training neural networks for reliable inference under hardware faults. In Proceedings of the 2020 IEEE International Test Conference (ITC ’20). 1-10.

[61]

Guanpeng Li, Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel S. Emer, Stephen W. Keckler. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’17). 1–12.

Index Terms

Transient Fault Detection in Tensor Cores for Modern GPUs
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Reliability

Recommendations

Acceleration of Tensor-Product Operations with Tensor Cores
In this article, we explore the acceleration of tensor product operations in finite element methods, leveraging the computational power of the NVIDIA A100 GPU Tensor Cores. We provide an accessible overview of the necessary mathematical background and ...
MixPert: Optimizing Mixed-Precision Floating-Point Emulation on GPU Integer Tensor Cores
LCTES 2024: Proceedings of the 25th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Featuring mixed-precision tensor operations, accelerators significantly enhance performance for many error-tolerant computing tasks, but their applicability is limited in scenarios demanding high precision. While emulating higher-precision data types ...
PTTS: Power-aware tensor cores using two-sided sparsity
Abstract
Deep Neural networks (DNNs) have become the compelling solution for a broad range of applications such as automatic translation, advertisement recommendation, and speech recognition. Matrix multiplication is the fundamental operation ...
Highlights
- GPGPUs based on Tensor Core architecture are power hungry devices.
- PTTS+ ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems

ACM Transactions on Embedded Computing Systems Volume 23, Issue 5

September 2024

549 pages

EISSN:1558-3465

DOI:10.1145/3613632

Editor:
Tulika Mitra
National University of Singapore, Singapore

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

ACM Journals for the Design of Smart and Connected Systems

Publication History

Published: 28 August 2024

Online AM: 10 August 2024

Accepted: 02 August 2024

Revised: 09 April 2024

Received: 27 September 2023

Published in TECS Volume 23, Issue 5

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
183
Total Downloads

Downloads (Last 12 months)183
Downloads (Last 6 weeks)18

Reflects downloads up to 26 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents