Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems
Abstract
1 Introduction
2 Related Work
2.1 SpMV Optimization on Single GPU
2.2 SpMV Optimization on Multi-GPU Systems
3 Methodology
3.1 Non-zeros-based Matrix Partitioning
3.2 Two-level Non-zeros-based Matrix Partitioning
3.3 Long-row-aware Sparse Matrix Partitioning
3.4 Redundant-computing Optimization
4 Experimental Evaluation
4.1 Setup
Platform 1 | Platform 2 | ||
---|---|---|---|
Hardware | Host | CPU: Intel(R) Core(TM) i9-10920X, 3.5 GHz, 12 cores Memory: 192 GB | CPU: Intel(R) Xeon(R) Gold 5218 v5, 2.3 GHz, 16 cores Memory: 192 GB |
Device | GPU: GeForce RTX 3090, 1.70 GHz, 10,496 cores Memory: 24 GB | GPU: Tesla V100-SXM2, 1.53 GHz, 5,120 cores Memory: 32 GB | |
OS | 64-bit Ubuntu 18.04 | 64-bit CentOS 7.9.2009 | |
Compiler | nvcc:11.1.74; gcc/g++: 9.2.0 | nvcc:11.6.55; gcc/g++: 7.5.0 | |
Software | Library | cuSPARSE: 11.1 | cuSPARSE: 11.6 |
ID | Matrix | Rows | NNZ | Ave. | Min. | Max. |
---|---|---|---|---|---|---|
1 | kmer_U1a | 67,716,231 | 138,778,562 | 2 | 1 | 35 |
2 | kmer_A2a | 170,728,175 | 360,585,172 | 2 | 1 | 40 |
3 | kmer_V2a | 55,042,369 | 117,217,600 | 2 | 1 | 39 |
4 | kmer_P1a | 139,353,211 | 297,829,984 | 2 | 1 | 40 |
5 | kmer_V1r | 214,005,017 | 465,410,904 | 2 | 1 | 8 |
6 | webbase-2001 | 118,142,155 | 1,019,903,190 | 9 | 0 | 3,841 |
7 | FullChip | 2,987,012 | 26,621,983 | 9 | 1 | 2,312,481 |
8 | ljournal-2008 | 5,363,260 | 79,023,142 | 15 | 0 | 2,469 |
9 | rgg_n_2_23_s0 | 8,388,608 | 127,002,786 | 15 | 0 | 40 |
10 | rgg_n_2_24_s0 | 16,777,216 | 265,114,400 | 16 | 0 | 40 |
11 | uk-2002 | 18,520,486 | 298,113,762 | 16 | 0 | 2,450 |
12 | com-LiveJournal | 3,997,962 | 69,362,378 | 17 | 1 | 14,815 |
13 | uk-2005 | 39,459,925 | 936,364,282 | 24 | 0 | 5,213 |
14 | GAP-twitter | 61,578,415 | 1,468,364,884 | 24 | 0 | 2,997,469 |
15 | indochina-2004 | 7,414,866 | 194,109,311 | 26 | 0 | 6,985 |
16 | nlpkkt160 | 8,345,600 | 225,422,112 | 28 | 5 | 28 |
17 | nlpkkt200 | 16,240,000 | 440,225,632 | 28 | 5 | 28 |
18 | nlpkkt240 | 27,993,600 | 760,648,352 | 28 | 5 | 28 |
19 | it-2004 | 41,291,594 | 1,150,725,436 | 28 | 0 | 9,964 |
20 | arabic-2005 | 22,744,080 | 639,999,458 | 28 | 0 | 9,905 |
21 | stokes | 11,449,533 | 349,321,980 | 31 | 1 | 720 |
22 | twitter7 | 41,652,230 | 1,468,365,182 | 35 | 0 | 2,997,469 |
23 | GAP-web | 50,636,151 | 1,930,292,948 | 38 | 0 | 12,869 |
24 | sk-2005 | 50,636,154 | 1,949,412,601 | 38 | 0 | 12,870 |
4.2 Experimental Results
4.2.1 Parameter Selection.
Platform | aveNNZ<8 | \(aveNNZ\ge 8\) | ||||
---|---|---|---|---|---|---|
LRA | LRA+RC | LRA | LRA+RC | |||
\(d_l\) | \(d_l\) | \(d_c\) | \(d_l\) | \(d_l\) | \(d_c\) | |
P1-2GPU | 0.50 | 0.40 | 0.25 | 0.35 | 0.35 | 0.00 |
P2-2GPU | 0.50 | 0.40 | 0.15 | 0.30 | 0.25 | 0.05 |
P2-4GPU | 0.50 | 0.35 | 0.20 | 0.35 | 0.25 | 0.05 |
4.2.2 Performance Comparison.
4.2.3 Application Evaluation.
4.2.4 Preprocessing Overhead.
Metric | P1-2GPU | P2-2GPU | P2-4GPU |
---|---|---|---|
Minimum | 0.01 | 0.01 | 0.01 |
Arithmetic mean | 0.32 | 0.18 | 0.20 |
Geometric mean | 0.21 | 0.11 | 0.12 |
Maximum | 0.83 | 0.57 | 0.87 |
5 Conclusion
Acknowledgments
Footnotes
References
Index Terms
- Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems
Recommendations
Adaptive Sparse Matrix-Vector Multiplication on CPU-GPU Heterogeneous Architecture
HPCCT '19: Proceedings of the 2019 3rd High Performance Computing and Cluster Technologies ConferenceSpMV is the core algorithm in solving the sparse linear equations, which is widely used in many research and engineering application field. GPU is the most common coprocessor in high-performance computing domain, and has already been proven to ...
CUDA-enabled Sparse Matrix-Vector Multiplication on GPUs using atomic operations
We propose the Sliced Coordinate Format (SCOO) for Sparse Matrix-Vector Multiplication on GPUs.An associated CUDA implementation which takes advantage of atomic operations is presented.We propose partitioning methods to transform a given sparse matrix ...
Performance optimization of Sparse Matrix-Vector Multiplication for multi-component PDE-based applications using GPUs
Simulations of many multi-component PDE-based applications, such as petroleum reservoirs or reacting flows, are dominated by the solution, on each time step and within each Newton step, of large sparse linear systems. The standard solver is a ...
Comments
Information & Contributors
Information
Published In
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
Check for updates
Author Tags
Qualifiers
- Research-article
Funding Sources
- Postdoctoral Fellowship Program of China Postdoctoral Science Foundation
- National Natural Science Foundation of China
Contributors
Other Metrics
Bibliometrics & Citations
Bibliometrics
Article Metrics
- 0Total Citations
- 1,027Total Downloads
- Downloads (Last 12 months)1,027
- Downloads (Last 6 weeks)340
Other Metrics
Citations
View Options
Login options
Check if you have access through your login credentials or your institution to get full access on this article.
Sign in