1 Introduction
The multiplication of
sparse matrix and vector (SpMV) has been one of the key computations in
high-performance computing (HPC), including recommendation systems [
40], linear programming [
42], graph analytics [
6], and big-data analytics [
7]. In addition to such HPC applications, the computation of sparse neural networks employs SpMV as the basic building block in neural network accelerators [
14,
16,
51]. Recent transformer accelerators also use matrix-vector multiplicator as the primary computation engine [
26].
Due to the importance of SpMV operation, there have been several prior accelerator studies that leveraged the unique value property and explored custom hardware acceleration [
16,
17,
51]. While these accelerators have their own unique designs, there are two major architectural design choices: algorithm and matrix format. Although software kernels for GPU/CPU have considered variants of such algorithms and formats, the advent of accelerators and the widening workloads by machine learning have made certain design options that were not commonly used in CPU/GPU computation attractive for custom accelerators.
The first aspect is the matrix-vector multiplication algorithm, which has two possible approaches: dot product and scalar multiplication. Dot product has been the dominant algorithm for software kernels, but accelerators made scalar multiplication a viable option with their customized data flow [
1,
5,
17,
25,
41]. The second aspect is the sparse matrix representation, which has two common representative approaches: compressed sparse data format (e.g., CSR and CSC), and bitmap format. In traditional HPC workloads with high sparsity, compressed formats are dominantly used, but sparse ML computation with relatively low sparsity can opt for bitmap formats due to their compact representation capability. With these two design aspects, there are four possible designs.
However, existing SpMV accelerators have exclusively belonged to one of the four design classes. It is yet to be investigated whether a design choice is superior to the others across different workloads and hardware configurations, and why a design choice should be selected for a given workload. Therefore, in addition to optimizing a fixed algorithm and data representation, it is necessary to investigate the possible design options more thoroughly.
With various matrix dimensions and sparsities from HPC workloads and sparse neural network models, this study shows a single fixed design does not always produce the best performance across different architectural configurations, matrices, and input vectors. Dimensions and densities of matrices and vectors affect the efficiency of different sparsity representations, detection mechanisms, and algorithms. Our design space exploration reveals two complementary design options (i.e., compressed sparse format and bitmap format with dot product), one of the two options provides a good performance across a range of input workloads, while neither of them is superior to the other. In addition, a single accelerator needs to support both sparse and dense data effectively, since it is efficient to use the basic dense computation if the input matrix density is relatively high.
To maximize the efficiency of SpMV acceleration under a wide variety of matrix and vector characteristics, this paper proposes a novel triple-mode SpMV accelerator called Cerberus, which can change the mode for the best data representation. This paper shows that a single hardware substrate can support dot product-based SpMV computation with two complementary SpMV configurations (i.e., compressed sparse format and bitmap format for sparse matrix representation) and the dense MV computation effectively. Each processing element (PE) contains the indexing mechanism, which supports not only the two computation modes, but also the dense mode. The formats of input matrices are determined based on the characteristics of the input matrix. While the accelerator supports multi-mode operations, to avoid duplicate inputs in different formats, the input matrix is stored in memory using only one format among the three (i.e., compressed sparse, bitmap, and dense).
To support such a multi-mode accelerator, the mode decision process for a given workload is a key technique. This study proposes a selection algorithm, which finds the best mode based on the characteristics of the input matrix without actually running the model on an accelerator. The selection algorithm uses only simple hyperparameters, such as dimension and density, of the input matrix, and suggests which mode is the best.
Instead of multi-mode accelerators, it is possible to use a different accelerator for each workload. Such an approach can be feasible with FPGA accelerators. However, ASIC accelerators with better performance must process a wide variety of workloads with a fixed hardware substrate. Our triple mode support can adaptively handle such various workloads efficiently.
We evaluate the performance with architectural simulation and implemented the hardware in Chisel for behavioral validation and synthesis. We plan to open-source the hardware design after publication. The experimental results from our SpMV accelerator simulation demonstrate that the multi-mode architecture can provide on average 12.1× performance boost over dense-only computation, and 1.5× boost over a fixed best SpMV design across the benchmark workloads and hardware configurations. The prediction model can accurately find the best configuration with 83.3% accuracy, reaching 90% of performance with the oracle model.
Although there have been studies for choosing a good matrix format for GPU SpMV computation [
11,
57], to the best of our knowledge, this is the first study to investigate the design space of SpMV accelerators for different algorithms in addition to data representations. Based on the analysis, we propose a performance prediction model and triple-mode accelerator architecture. The main contributions of the paper are as follows:
—
This study quantifies the performance implications of algorithm and data representation selection for SpMV accelerators.
—
The study finds two complementary design options and proposes Cerberus, a triple-mode accelerator that can process SpMV with the best data representation for a wide range of matrix and vector distributions. In addition, the architecture supports dense mode computation efficiently.
—
The study shows that a parameterized model can be used for the best mode selection. The dimensions and densities of matrices with hardware parameters determine the best mode.
4 Triple-mode Acceleration
4.1 Overview
As shown in the prior sections, switching among the three modes, \(C_mD_a\), \(B_mD_a\), and \(D_mD_a\), selectively with respect to the workload characteristics would potentially offer higher performance than a fixed best-performing mode. In this section, we propose Cerberus, a triple-mode accelerator architecture design for SpMV with a selection method. The triple-mode accelerator can switch its execution mode between \(C_mD_a\) and \(B_mD_a\) for sparse computations and also support dense computation. The software selection method determines which mode produces the best performance, purely based on the input characteristics such as matrix dimension and density. Its high accuracy shows that the best mode is highly correlated with the static input characteristics of each SpMV workload.
Figure
7 illustrates the overall architecture of Cerberus. To maximize the advantages of parallel executions on multiple PEs, we partition the matrix in a row-wise manner to fit in the 1D array of PE, as shown in Figure
8. Each PE is equipped with parallel multipliers and adders. We choose to use 1D PE organization since the 2D alternative requires inter-PE communication due to the row-wise merge operation, which incurs extra performance overheads by extending the critical path without the benefit of data reuse [
17]. Unlike dense MMs which can exploit the data reuse feature of a 2D PE array, for SpMV with unstructured and thus irregular sparsity, the 1D organization utilizes PEs more efficiently.
Mode Selection: Mode selection is done by the runtime driver software controlling the accelerators. The runtime system uses the dimension and density of the input matrix to determine the mode. The selection result is only dependent on the properties of matrices. For computation with fixed input matrices such as weight matrices for ML inferences, the mode is pre-selected for the matrices, and the matrix is stored in one of the sparse format (
\(C_mD_a\)), bitmap format (
\(B_mD_a\)), or dense format (
\(D_mD_a\)).
For certain applications, an input matrix can be dynamically updated over iterations of MV computation. For such applications, the conversion of the input matrix can occur between compressed and bitmap formats. However, such mode change is very infrequent, since the format selection depends on the density and dimension of the matrix. The dimension does not change for the same matrix, and the density changes only very gradually. In addition, the format conversion requires an update of only the indices or bitmaps without changing the compressed value array. As both formats are row-major, both read and write patterns from the old format to the new one are sequential.
The mode change in hardware causes a negligible overhead because it merely changes the choice of handler module to use. In addition, the mode change occurs only between two different instances of matrix-vector multiplications such as different layers in DNNs. Thus, it does not require any updates on the internal states of PEs. Note that the overhead of mode selection itself is relatively low, as it only requires an offline calculation of polynomial (i.e., performance model of Section
4.3).
4.2 Architecture
Figure
9 shows the detailed architecture of the PE. The intra-PE architecture can be divided into four parts:
on-chip scratchpad memory,
index calculator,
value registers, and
dot product computation unit.
On-chip scratchpad memory (SPM): Each PE contains a small on-chip SPM to store the input and output data. We choose the local SPM instead of the inter-PE shared buffer to minimize the interference between PEs. When the requested data is not in SPM, the PE sends a memory request to the off-chip memory.
Index Calculator: Since switching execution mode without significant overheads is the most important requirement for triple-mode architecture, we minimize the change of architecture in mode change by limiting it inside the
index calculator. The major difference between each mode is a way of calculating the column indices for non-zero values: how to detect the next value to be pushed to the multipliers. As described earlier in Figure
7, the
index calculator contains two handler modules for
\(C_mD_a\) and
\(B_mD_a\) modes. With
\(C_mD_a\) mode, the
index calculator first fetches row pointers and detects the position (column index) of non-zero elements contained in the row by reading the column index array sequentially. This requires two 64-bit registers: one for row pointers, and the other for column indices. For
\(B_mD_a\) mode, the bitmap handler uses a leading nonzero detection logic (LNZD), which sequentially checks the bitmap of a matrix row and returns the position of the next nonzero bit as a column index for the next nonzero value. We use the LNZD design from prior work [
33] that enables the LNZD unit to detect nonzero values in a 32-bit window with a delay of 1 cycle.
Also, for efficient nonzero-detection steps, the bitmap is saved in a 512-bit register. For \(D_mD_a\) mode, the accelerator can run efficiently by inactivating all components under the index calculator and feeding values sequentially. For a mode change, one of the modules is selected as the index calculator and the other module will be deactivated.
Value Registers (Fetchers): Using the position information of non-zero values (column indices) produced from the index calculator, the matrix and vector values for the corresponding positions are accessed and multiplied for MV operation. For this purpose, two 64-bit registers, matrix value register and vector value register, are used to fetch values from SPM and feed the data to the dot product computation unit.
Dot Product Computation Unit: To calculate the output values, the accelerator should execute dot product operations. The dot product computation unit is a MAC unit with one multiplier and one adder (accumulator). This single-MAC-per-PE design is sufficient because of two reasons: First, the proposed architecture can exploit parallelism through a PE array. Moreover, as the major performance bottleneck of SpMV is in index calculation, even if there is more than one MAC, the computational unit would not be utilized efficiently.
4.3 Software Selector
Analytical performance model: Our performance model takes two sets of variables as input, (1) Sparse matrix hyperparameters (matrix dimension and density), and (2) hardware specifications (the number of PEs and SPM size). We use a two-step selection algorithm. The first step decides whether the dense mode (\(D_mD_a\)) should be considered or not. If the matrix density is higher than the threshold of 87.5% (That is, the threshold of which dense matrix format reaches a footprint smaller than sparse formats), the selector includes the dense mode to the set of possible options, and thus it chooses the best mode out of three modes (Dense, Bitmap, and Compressed; \(D_mD_a\), \(B_mD_a\), and \(C_mD_a\)). If the matrix density is lower than 87.5%, \(D_mD_a\) becomes very inefficient compared to two other sparse modes, and is excluded from the mode selection, choosing the best one out of only two sparse modes (\(B_mD_a\) and \(C_mD_a\)). We adopted this first step of determining whether \(D_mD_a\) should be considered or not, to avoid overly complicating the performance model used in the second step. The performance model estimates the relative delay of each mode, instead of predicting its exact execution cycle.
After the set of possible modes is determined in the first step, the second step evaluates the performance scores of candidates with a performance model. The performance model is essentially a function, which produces a single score calculated by a combination of three estimated metrics: (1) the total time for on-chip memory accesses (
total_onchip_access_time()), (2) the total time for off-chip memory accesses (
total_offchip_access_time()), and (3) the total time of non-common additional operations (
total_additional_ops_time()), as illustrated in Equation (
1):
E represents the number of PEs, and
P is the number of SPM ports per PE. These latencies are calculated by multiplying the estimated amount of each architectural event and the statically collected cycle count number to perform the event on the hardware. The details for calculating the estimated amount (number) of each architectural event is available in Table
3.
As the zero-skipping overhead for vectors is ignorable for SpMV, the vector density information is unnecessary - which makes the offline mode selection feasible. Once the model calculates the score using the algorithmic and hardware-based parameters, the model can determine which mode would perform better for the given workload. For
\(C_mD_a\) mode,
total_additional_ops_time() is always 0, while
\(B_mD_a\) (and
\(D_mD_a\)) gets non-zero value.
\(B_mD_a\) mode requires leading non-zero detection (LNZD) for matrix bitmaps and the overhead is equal to the number of matrix non-zero values for
\(B_mD_a\) mode.
Effectiveness of the performance model: We evaluate the reliability of the performance model with the microbenchmarks of Table
6. Figure
10 reports the correlation between the calculated score by the performance model and the performance in cycle counts collected through simulation experiments. For the entirety of 600 microbenchmarks, we notice a clear trend line, which corroborates the effectiveness of the model. The
\(R^2\) coefficient score is 0.77 which is also consistent with the above observation. Although the model-estimated score and the performance of Figure
10 is not perfectly linear, the performance model is still fair enough as the performance model is only required for decision by simply comparing the performance score of each mode (i.e., selecting the smallest value among two or three candidates). The accuracy of the model-based software selector is 79.8%, and the misprediction occurs when the performance gap between the two modes are low: thus, with the model-based software selector, the multi-mode accelerator reaches 92.2% of the oracle performance.
6 Related Work
SpMV accelerators: Many accelerator architectures have been proposed for SpMV. EIE [
17] is one of the initial studies in accelerating SpMV in neural networks. Tensaurus [
59] regards SpMV as a subset of a larger problem domain; general sparse tensor - dense tensor multiplication. Two-Step [
53] concentrates on utilizing memory bandwidth with a divide-and-conquer strategy for SpMV. SpaceA [
63] overcomes the limitation of memory bandwidth with processing-in-memory architecture for SpMV. MASR [
16] focuses on acceleration of specific RNN inference using a bitmap to encode inputs of SpMV. SIGMA [
51] accelerates nonsquare SpGEMM for sparse neural network training, treating SpMV as a subset of their problem domain.
Software frameworks for SpMV acceleration: For CPU or GPU computing, compressed (CSR) matrix-dot product algorithm is a common combination for SpMV. Intel MKL [
30] is a software library for CPUs, while cuSPARSE [
46] and bhSPARSE [
38] work as a software framework for GPUs. These libraries support various operations with sparse matrices including SpMV. Furthermore, Sedaghati et al. [
57] and Dinkelbach et al. [
11] introduce the automated sparse matrix representation selector for SpMV with ML-based decision model which is applicable for GPU SW.
Other sparse accelerators: Researchers have been focusing on hardware acceleration for tensor algebra with sparse matrix. OuterSPACE [
48], Sparse-TPU [
21], SpArch [
66], MatRaptor [
58], and InnerSP [
3] focus on the acceleration of SpGEMM for HPC applications. ExTensor [
22] is a general sparse tensor algebra accelerator, and Qin et al. [
50] accelerate sparse tensor - dense tensor multiplication with support for multiple sparse formats. Flexagon [
43] proposes a flexible SpGEMM accelerator that supports multiple dataflow for sparse neural networks. SpAtten [
60] and Sanger [
39] handle sparse attention layers and transformers based on the sparse matrix multiplication acceleration. SparTen [
15], SCNN [
49], UCNN [
23], GoSPA [
9], and Cambricon-X [
65] all accelerate sparse CNN inference. Procrustes [
64] deals with the training of sparse CNN. Sparseloop [
62] provides a framework for selecting the best hardware architecture for sparse CNN inference.