skip to main content
research-article

Transient Fault Detection in Tensor Cores for Modern GPUs

Published: 28 August 2024 Publication History

Abstract

Deep neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of the TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance.
In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. The high level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with a low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs.

References

[1]
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. 2007. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’07). 3–14.
[2]
Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel Emer, Stephen W. Keckler, and William J. Dally. 2017. SCNN: An accelerator for compressed-sparse convolutional neural networks. In Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA ’17). IEEE, 27–40.
[3]
J. Zhang, X. Chen, M. Song, and T. Li. 2019. Eager Pruning: Algorithm and architecture support for fast training of deep neural networks. In Proceedings of the 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA ’19). 292–303.
[4]
R. Nathan and D. J. Sorin. 2015. Argus-G: Comprehensive, low-cost error detection for GPGPU cores. IEEE Computer Architecture Letters 14, 1 (2015), 13–16.
[5]
M. A. Raihan, N. Goli, and T. M. Aamodt. 2019. Modeling deep learning accelerator enabled GPUs. In Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’19). 79–92.
[6]
J. Albericio, P. Judd, T. Hetherington, T. Aamodt, N. E. Jerger, and A. Moshovos. 2008. Cnvlutin: Ineffectual-neuron-free deep neural network computing. In Proceedings of the 43rd International Symposium on Computer Architecture.
[7]
NVIDIA Corporation. 2008. cuBLAS Developer Guide. Retrieved August 12, 2024 from https://docs.nvidia.com/cuda/cublas/index.html
[8]
M. Dimitrov, M. Mantor, and H. Zhou. 2009. Understanding software approaches for GPGPU reliability. In Proceedings of the Workshop on General Purpose Processing on Graphics Processing Units (GPGPU ’09). 94–104.
[9]
S. Li, V. Sridharan, S. Gurumurthi, and S. Yalamanchili. 2014. Position paper: Software-based techniques for reducing the vulnerability of GPU applications. In Proceedings of the Dependable GPU Computing Workshop of the ACM/IEEE DATE 2014 Conference. 1–5.
[10]
H. Mostafa and X. Wang. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In Proceedings of the International Conference on Machine Learning. 4646–4655.
[11]
E. Atoofian. 2023. PTTS: Power-aware tensor cores using two-sided sparsity. Journal of Parallel and Distributed Computing 173 (2023), 70–82.
[12]
H. Sak, A. W. Senior, and F. Beaufays. 2014. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH ’14). 338–342.
[13]
NVIDIA Corporation. 2022. NVIDIA A100 Tensor Core GPU Architecture. Retrieved August 12, 2024 from https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf
[14]
A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, 6 (2017), 84–90.
[15]
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic linear algebra subprograms for Fortran usage. ACM Transactions on Mathematical Software 5, 3 (1979), 308–323.
[17]
NC State University. n.d. FreePDK Process Design Kit. Retrieved August 12, 2024 from http://www.eda.ncsu.edu/wiki/FreePDK
[18]
NVIDIA Corporation. 2017. NVIDIA Tesla V100 GPU Architecture: The World's Most Advanced Data Center GPU. White Paper WP-80608 v1.1. NVIDIA.
[19]
NVIDIA Corporation. 2018. NVIDIA Turing GPU Architecture. White Paper WP-09183-001 v01. NVIDIA.
[20]
I. Sutskever, O. Vinyals, and Q. V. Le. 2014. Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems 27 (2014), 1–9.
[21]
H. Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight error detection for GPGPU. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. 37–47.
[22]
K. Simonyan and A. Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[23]
K. S. Yim, C. Pham, M. Saleheen, Z. Kalbarczyk, and R. Iyer. 2011. Hauberk: Lightweight silent data corruption error detector for GPGPU. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium. 287–300.
[24]
S. Markidis, S. Chien, E. Laure, I. Peng, and J. Vetter. 2018. NVIDIA tensor core programmability, performance, and precision. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops. 522–531.
[25]
L. Yang, B. Nie, A. Jog, and E. Smirni. 2021. Practical resilience analysis of GPGPU applications in the presence of single- and multi-bit faults. IEEE Transactions on Computers 70, 1 (2021), 30–44.
[26]
D. J. Palframan, N. S. Kim, and M. H. Lipasti. 2014. Precision-aware soft error protection for GPUs. In Proceedings of the 20th International Symposium on High Performance Computer Architecture (HPCA ’14). 49–59.
[27]
Y. Wang, C. Zhang, Z. Xie, C. Guo, Y. Liu, and J. Leng. 2021. Dual-side sparse tensor core. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA ’21). 1083–1095.
[28]
M. Zhu, T. Zhang, Z. Gu, and Y. Xie. 2019. Sparse tensor core: Algorithm and hardware co-design for vector-wise sparse neural networks on modern GPUs. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture.
[29]
C. Braun, S. Halder, and H. J. Wunderlich. 2014. A-ABFT: Autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 443–454.
[30]
C. Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’15). 1–9.
[31]
L. Zhang, Y. Han, Q. Xu, and X. Li. 2008. Defect tolerance in homogeneous manycore processors using core-level redundancy with unified topology. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition. 891–896.
[32]
UC Irvine Machine Learning Repository. 2013. Daily and Sports Activities: Donated on 7/7/2013. Retrieved August 12, 2024 from https://archive.ics.uci.edu/ml/datasets/Daily+and+Sports+Activities
[33]
D. P. Siewiorek and R. S. Swarz. 1992. Reliable Computer Systems: Design and Evaluation (2nd ed.). Digital Press.
[34]
N. D. Lane, S. Bhattacharya, P. Georgiev, C. Forlivesi, and F. Kawsar. 2015. An early resource characterization of deep learning on wearables, smartphones and Internet-of-Things devices. In Proceedings of the 2015 International Workshop on Internet of Things Towards Applications. 7–12.
[35]
NVIDIA Corporation. 2017. Programming Tensor Cores in CUDA 9. Retrieved August 12, 2024 from https://docs.nvidia.com/cuda/archive/9.0/
[36]
NVIDIA Corporation. 2017. CUDA C Programming Guide (CUDA 9.0). Retrieved August 12, 2024 from https://docs.nvidia.com/cuda/archive/9.0/cuda-c-programming-guide/index.html
[37]
UC Irvine Machine Learning Repository. 1994. ISOLET: Donated on 9/11/1994. Retrieved August 12, 2024 from http://archive.ics.uci.edu/ml/datasets/ISOLET
[38]
Y. LeCun, C. Cortes, and C. J. Burges. 1998. MNIST Database of Handwritten Digits. Retrieved August 12, 2024 from https://yann.lecun.com/exdb/mnist/
[39]
A. Krizhevsky. n.d. The CIFAR-10 Dataset. Retrieved August 12, 2024 from https://www.cs.toronto.edu/~kriz/cifar.html
[40]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision 115, 3 (2015), 211–252.
[41]
GitHub. 2020. Caffe Model Zoo. Retrieved August 12, 2024 from https://github.com/BVLC/caffe/wiki/Model-Zoo
[42]
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In Proceedings of the 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174.
[43]
M. Abdel-Majeed, W. Dweik, H. Jeon, and M. Annavaram. 2015. Warped-RE: Low-cost error detection and correction in GPUs. In Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. 331–342.
[44]
NVIDIA Corporation. n.d. CUTLASS: Fast Linear Algebra in CUDA C++. Retrieved August 12, 2024 from https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/.
[45]
I. Sutskever, J. Martens, G. Dahl, and G. Hinton. 2023. On the importance of initialization and momentum in deep learning. In Proceedings of the International Conference on Machine Learning. 1139–1147.
[46]
V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. 2010. Debunking the 100x GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 451–460.
[47]
A. Gondimalla, N. Chesnut, M. Thottethodi, and T. N. Vijaykumar. 2019. SparTen: A sparse tensor accelerator for convolutional neural networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 151–165.
[48]
T. Dettmers and L. Zettlemoyer. 2019. Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840 (2019).
[49]
J. Redmon and A. Farhadi. 2017. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’17). 6517–6525. IEEE.
[50]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’16). 770–778.
[51]
F. F. dos Santos, P. F. Pimenta, C. Lunardi, L. Draghetti, L. Carro, D. Kaeli, and P. Rech. 2019. Analyzing and increasing the reliability of convolutional neural networks on GPUs. IEEE Transactions on Reliability 68, 2 (2019), 663–677.
[52]
A. Mahmoud, S. K. S. Hari, C. W. Fletcher, S. V. Adve, C. Sakr, N. R. Shanbhag, P. Molchanov, M. B. Sullivan, T. Tsai, and S. W. Keckler. 2021. Optimizing selective protection for CNN resilience. In Proceedings of the 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE ’21). 127–138.
[53]
C. Bolchini, L. Cassano, A. Miele, and A. Nazzari. 2022. Selective hardening of CNNs based on layer vulnerability estimation. In Proceedings of the 2022 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT ’22). 1–6.
[54]
P. M. Basso, F. F. dos Santos, and P. Rech. 2020. Impact of tensor cores and mixed precision on the reliability of matrix multiplication in GPUs. IEEE Transactions on Nuclear Science 67, 7 (2020), 1560–1565.
[55]
J. Kosaian and K. V. Rashmi. 2021. Arithmetic-intensity-guided fault tolerance for neural network inference on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’21). 1–15.
[56]
S. K. S. Hari, M. B. Sullivan, T. Tsai, and S. W. Keckler. 2022. Making convolutions resilient via algorithm-based error detection techniques. IEEE Transactions on Dependable and Secure Computing 19, 4 (2022), 2546–2558.
[57]
M. B. Sullivan, S. K. S. Hari, B. Zimmer, T. Tsai, and S. W. Keckler. 2018. SwapCodes: Error codes for hardware-software cooperative GPU pipeline error detection. In Proceedings of the 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’18). 762–774.
[58]
D. A. G. Oliveira, P. Rech, H. M. Quinn, T. D. Fairbanks, L. Monroe, S. E. Michalak, C. Anderson-Cook, P. O. A. Navaux, and L. Carro. 2014. Modern GPUs radiation sensitivity evaluation and mitigation through duplication with comparison. IEEE Transactions on Nuclear Science 61, 6 (2014), 3115–3122.
[59]
N. Cavagnero, F. D. Santos, M. Ciccone, G. Averta, T. Tommasi, and P. Rech. 2022. Transient-fault-aware design and training to enhance DNNs reliability with zero-overhead. In Proceedings of the 2022 IEEE 28th International Symposium on On-Line Testing and Robust System Design (IOLTS ’22). 1–7.
[60]
U. Zahid, G. Gambardella, N. J. Fraser, M. Blott, and K. Vissers. 2020. FAT: Training neural networks for reliable inference under hardware faults. In Proceedings of the 2020 IEEE International Test Conference (ITC ’20). 1-10.
[61]
Guanpeng Li, Siva Kumar Sastry Hari, Michael B. Sullivan, Timothy Tsai, Karthik Pattabiraman, Joel S. Emer, Stephen W. Keckler. Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’17). 1–12.

Index Terms

  1. Transient Fault Detection in Tensor Cores for Modern GPUs

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Embedded Computing Systems
    ACM Transactions on Embedded Computing Systems  Volume 23, Issue 5
    September 2024
    549 pages
    EISSN:1558-3465
    DOI:10.1145/3613632
    • Editor:
    • Tulika Mitra
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Journal Family

    Publication History

    Published: 28 August 2024
    Online AM: 10 August 2024
    Accepted: 02 August 2024
    Revised: 09 April 2024
    Received: 27 September 2023
    Published in TECS Volume 23, Issue 5

    Check for updates

    Author Tags

    1. Deep neural networks
    2. graphics processing unit
    3. tensor core
    4. reliability

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 183
      Total Downloads
    • Downloads (Last 12 months)183
    • Downloads (Last 6 weeks)18
    Reflects downloads up to 26 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media