This repo evaluates different matrix multiplication implementations given two large square matrices (2000-by-2000 in the following example):
| Implementation | Long description |
|---|---|
| Naive | Most obvious implementation |
| Transposed | Transposing the second matrix for cache efficiency |
| sdot w/o hints | Replacing the inner loop with BLAS sdot() |
| sdot with hints | sdot() with a bit unrolled loop |
| SSE sdot | vectorized sdot() with explicit SSE instructions |
| SSE+tiling sdot | SSE sdot() with loop tiling |
| OpenBLAS sdot | sdot() provided by OpenBLAS |
| OpenBLAS sgemm | sgemm() provided by OpenBLAS |
To compile the evaluation program:
make CBLAS=/path/to/cblas/prefixor omit the CBLAS setting you don't have it. After compilation, use
./matmul -hto see the available options. Here is the result on my machines:
| Implementation | -a | Linux,-n2000 | Linux,-n4000 | Linux/icc,-n4000 | Mac,-n2000 |
|---|---|---|---|---|---|
| Naive | 0 | 7.53 sec | 188.85 sec | 173.76 sec | 77.45 sec |
| Transposed | 1 | 6.66 sec | 55.48 sec | 21.04 sec | 9.73 sec |
| sdot w/o hints | 4 | 6.66 sec | 55.04 sec | 21.35 sec | 9.70 sec |
| sdot with hints | 3 | 2.41 sec | 29.47 sec | 21.69 sec | 2.92 sec |
| SSE sdot | 2 | 1.36 sec | 21.79 sec | 22.18 sec | 2.92 sec |
| SSE+tiling sdot | 7 | 1.11 sec | 10.84 sec | 10.97 sec | 1.90 sec |
| OpenBLAS sdot | 5 | 2.69 sec | 28.87 sec | 5.61 sec | |
| OpenBLAS sgemm | 6 | 0.63 sec | 4.91 sec | 0.86 sec | |
| uBLAS | 7.43 sec | 165.74 sec | |||
| Eigen | 0.61 sec | 4.76 sec | 5.01 sec | 0.85 sec |
The machine configurations are as follows:
| Machine | CPU | OS | Compiler |
|---|---|---|---|
| Linux | 2.6 GHz Xeon E5-2697 | CentOS 6 | gcc-4.4.7/icc-15.0.3 |
| Mac | 1.7 GHz Intel Core i5-2557M | OS X 10.9.5 | clang-600.0.57/LLVM-3.5svn |
On both machines, OpenBLAS-0.2.18 is compiled with the following options (no AVX or multithreading):
TARGET=CORE2
BINARY=64
USE_THREAD=0
NO_SHARED=1
ONLY_CBLAS=1
NO_LAPACK=1
NO_LAPACKE=1