GEMMul8 (GEMMulate): GEMM emulation using int8 matrix engines based on the Ozaki Scheme II
GEMMul8 is a library for emulating high-precision matrix multiplication (SGEMM/DGEMM/CGEMM/ZGEMM) using low-precision INT8 matrix engines. It is based on the Ozaki Scheme II, enabling bit-wise reproducible results with superior performance and power efficiency compared to native floating-point implementations.
- Technical Overview
- Build
- Running Sample Codes
- Usage
- Numerical results
- Acknowledgment
- Contact (Responsible Developer)
- References
- Citations
- License
GEMMul8 implements GEMM emulation based on Ozaki scheme II, which utilizes the Chinese Remainder Theorem (CRT).
A larger number of moduli (num_moduli) for the CRT results in higher precision at the cost of increased computation time.
This project supports both CUDA and HIP backends for GPU computation.
The GEMM emulation ensures bit-wise numerical reproducibility. This is achieved through a combination of exact, error-free computations and a fixed order of operations.
However, please note that this reproducibility guarantee may not extend to environments using different toolkits or compilers. Discrepancies can arise from varying propagation rules for Inf/NaN values or differences in the precision of underlying math library functions.
Run make in the GEMMUl8/GEMMul8 directory to compile all source files.
make -j8To rebuild from scratch, run make clean && make -j8.
| Option | Default | Description |
|---|---|---|
CUDA_PATH |
/usr/local/cuda |
Path to your CUDA toolkit installation. Used for CUDA backends. |
HIP_PATH |
/opt/rocm |
Path to your HIP (ROCm) toolkit installation. Used for HIP backends. |
BACKEND |
auto |
Select GPU backend: cuda, hip, or auto (auto-detect). |
GPU_ARCH |
auto |
Target GPU architecture. Examples: 80 (A100), 90 (H100), gfx90a (MI250X), gfx942 (MI300X). |
ozIMMU_EF |
no |
Set to yes to use ozIMMU in the sample code. |
ozIMMU_EF_DIR |
../../ozIMMU_EF |
Path to ozIMMU implementation. Used if ozIMMU_EF=yes. |
cuMpSGEMM |
no |
Set to yes to use cuMpSGEMM in the sample code. |
cuMpSGEMM_DIR |
../../cuMpSGEMM |
Path to cuMpSGEMM. Used if cuMpSGEMM=yes. |
HIJACK |
no |
Set to yes to hijack cuBLAS GEMM calls with emulation in the sample code. |
Note
- If you enable optional modules (ozIMMU_EF=yes, cuMpSGEMM=yes), please clone and build their repositories first.
- cuMpSGEMM - CUDA Mutable-precision SGEMM
- ozIMMU - DGEMM on Int8 Tensor Core
- See also Accelerator for ozIMMU
- If you use these, please ensure compliance with their respective license terms.
BACKEND=autowill attempt to detect your GPU vendor automatically.GPU_ARCH=autowill automatically detect and use the appropriate compute capability or architecture for your GPU.- Target GPU architecture can be found from e.g., CUDA GPU Compute Capability or AMD GPU hardware specifications.
Build for an NVIDIA H100/H200 GPU (Compute Capability 9.0)
make -j8 BACKEND=cuda CUDA_PATH=/usr/local/cuda GPU_ARCH=90Build for an AMD MI300X GPU (gfx942 architecture)
make -j8 BACKEND=hip HIP_PATH=/opt/rocm GPU_ARCH=gfx942After building the project, you can run sample codes from the testing directory.
Navigate to the testing directory and use make to run tests for different precisions.
| Mode | Value | Description |
|---|---|---|
test_s |
(no) | Tests for SGEMM |
test_d |
(no) | Tests for DGEMM |
test_c |
(no) | Tests for CGEMM |
test_z |
(no) | Tests for ZGEMM |
MODE |
accuracy |
Tests numerical accuracy (maximum element-wise relative error). |
flops |
Measures TFLOPS with square matrices. | |
watt |
Measures watt and GFLOPS/watt with square matrices. | |
flops_rect |
Measures TFLOPS with rectangular matrices. | |
watt_rect |
Measures watt and GFLOPS/watt with rectangular matrices. | |
all |
Runs all available test modes. |
# Run accuracy test (DGEMM)
make test_d MODE="accuracy"# Run all tests (SGEMM)
make test_s MODE="all"# Run accuracy, flops, & watt tests (SGEMM & DGEMM)
make test_s test_d MODE="accuracy flops watt"This library provides two ways to use the GEMM emulation.
Call GEMMul8 functions explicitly from your source code. This gives you fine-grained control over the emulation parameters.
#include "gemmul8.hpp"
// 1. Create a handle to the cuBLAS library context
cublasHandle_t cublas_handle;
cublasCreate(&cublas_handle);
// 2. Settings
const unsigned num_moduli = 14u; // Accuracy knob: 2 <= num_moduli <= 20
const bool fastmode = true; // true (fast mode) or false (accurate mode)
// 3. Allocate workspace
const size_t worksize = gemmul8::workSize(m, n, k, num_moduli); // calculate required memory (Byte)
void *work;
cudaMalloc(&work, worksize);
// 4. (Optional) Create a vector to store timing breakdown
std::vector<double> time_breakdown(4, 0.0);
// 5. Run emulation
// The function returns a vector with execution times for each phase.
time_breakdown = gemmul8::gemm(cublas_handle,
CUBLAS_OP_N, CUBLAS_OP_N,
m, n, k,
&alpha, devA, lda,
devB, ldb,
&beta, devC, ldc,
num_moduli, fastmode, work);
// 6. Free workspace
cudaFree(work);
// 7. Destroy a handle
cublasDestroy(cublas_andle);The arguments for gemmul8::gemm closely match the standard cublas<t>gemm and hipblasXgemm interfaces, with additional parameters that control internal preprocessing and workspace reuse.
GEMMul8 provides both a workspace query function (workSize) and an extended GEMM computation interface (gemm), as shown below:
namespace gemmul8 {
// workSize returns the required workspace size in bytes.
template <bool is_Complex = false> // [option] If true, return workspace size for CGEMM/ZGEMM
size_t workSize(
const size_t m, // Number of rows of C
const size_t n, // Number of columns of C
const size_t k, // Inner dimension <= 2^17
const unsigned num_moduli, // #moduli, 2 <= num_moduli <= 20
const bool enable_skip_scalA = false, // [option] Reserve extra space for A to allow skip_scalA
const bool enable_skip_scalB = false, // [option] Reserve extra space for B to allow skip_scalB
size_t *workSizeA = nullptr, // [option] Output: workspace size used for A8i and sftA
size_t *workSizeB = nullptr // [option] Output: workspace size used for B8i and sftB
);
// gemm returns computation time in second of each computational phase
// gemm returns computation time in second of each computational phase
#if defined(__NVCC__)
template <typename T> std::vector<double> gemm(
cublasHandle_t handle, // Handle to the cuBLAS library context
const cublasOperation_t op_A, // CUBLAS_OP_N or CUBLAS_OP_T
const cublasOperation_t op_B, // CUBLAS_OP_N or CUBLAS_OP_T
const size_t m, // Number of rows of C
const size_t n, // Number of columns of C
const size_t k, // Inner dimension <= 2^17
const T *alpha, // Scaling factor for op(A)*op(B)
const T *const A, // 1-D device array of dimensions lda*k (CUBLAS_OP_N) or lda*m (CUBLAS_OP_T)
const size_t lda, // Leading dimension of A
const T *const B, // 1-D device array of dimensions ldb*n (CUBLAS_OP_N) or ldb*k (CUBLAS_OP_T)
const size_t ldb, // Leading dimension of B
const T *beta, // Scaling factor for C
T *const C, // 1-D device array of dimensions ldc*n
const size_t ldc, // Leading dimension of C
const unsigned num_moduli, // #moduli, 2 <= num_moduli <= 20
const bool fastmode, // false (accurate mode) or true (fast mode)
void *const work, // Preallocated workspace
void *const workA = nullptr, // [optional] Separate workspace for A (if nullptr, uses work)
void *const workB = nullptr, // [optional] Separate workspace for B (if nullptr, uses work)
const bool enable_skip_scalA = false, // [optional] Enables scaling-skip mechanism for A
const bool enable_skip_scalB = false, // [optional] Enables scaling-skip mechanism for B
bool skip_scalA = false, // [optional] If true, skip preprocessing for A
bool skip_scalB = false // [optional] If true, skip preprocessing for B
);
#endif
#if defined(__HIPCC__)
template <typename T> std::vector<double> gemm(
hipblasHandle_t handle, // Handle to the cuBLAS library context
const hipblasOperation_t op_A, // CUBLAS_OP_N or CUBLAS_OP_T
const hipblasOperation_t op_B, // CUBLAS_OP_N or CUBLAS_OP_T
const size_t m, // Number of rows of C
const size_t n, // Number of columns of C
const size_t k, // Inner dimension <= 2^17
const T *alpha, // Scaling factor for op(A)*op(B)
const T *const A, // 1-D device array of dimensions lda*k (CUBLAS_OP_N) or lda*m (CUBLAS_OP_T)
const size_t lda, // Leading dimension of A
const T *const B, // 1-D device array of dimensions ldb*n (CUBLAS_OP_N) or ldb*k (CUBLAS_OP_T)
const size_t ldb, // Leading dimension of B
const T *beta, // Scaling factor for C
T *const C, // 1-D device array of dimensions ldc*n
const size_t ldc, // Leading dimension of C
const unsigned num_moduli, // #moduli, 2 <= num_moduli <= 20
const bool fastmode, // false (accurate mode) or true (fast mode)
void *const work, // Preallocated workspace
void *const workA = nullptr, // [optional] Separate workspace for A (if nullptr, uses work)
void *const workB = nullptr, // [optional] Separate workspace for B (if nullptr, uses work)
const bool enable_skip_scalA = false, // [optional] Enables scaling-skip mechanism for A
const bool enable_skip_scalB = false, // [optional] Enables scaling-skip mechanism for B
bool skip_scalA = false, // [optional] If true, skip preprocessing for A
bool skip_scalB = false // [optional] If true, skip preprocessing for B
);
#endif
} // namespace gemmul8gemmul8::gemminternally converts the input matricesAandBinto INT8 format and performs modular multiplications across multiple moduli.- The conversion (scaling) step can be skipped in consecutive GEMM calls if the same matrices are reused, allowing for substantial performance gains.
- If
enable_skip_scal{A|B} = true, additional workspace is preallocated so that the scaled INT8 representation ofA/Bcan be retained between calls. - If both
enable_skip_scal{A|B} && skip_scal{A|B} = true, the scaling step forA/Bis actually skipped, and previously prepared data are reused for faster execution. This mode assumes that the contents ofA/Bin device memory remain unchanged. - This mechanism is particularly effective when repeatedly multiplying the same
A(orB) with differentB(orA) matrices.
Note
When using skip_scalA / skip_scalB, the preprocessing step that converts A/B into its internal INT8 representation is skipped.
For correctness, the following conditions must all hold between consecutive GEMM calls:
- The dimensions (
M/KforA,K/NforB) must be identical to those in the previous call. - The operation type (
CUBLAS_OP_N/CUBLAS_OP_T) forA/Bmust be the same as before. - The value of
num_modulimust remain unchanged. - The
fastmodesetting must be identical to that of the previous call. - The contents of
A/Bin device memory must not be modified between calls.
Caution
If any of these conditions differ, the cached scaled data become invalid, and skipping must not be used.
In such cases, set skip_scalA=false / skip_scalB=false.
Caution
This skip mechanism is designed for repeated GEMM calls with identical A or B. Use it only when you are certain that the input matrices and configuration have not changed. When in doubt, disable skipping to ensure correctness.
#include "gemmul8.hpp"
// 1. Create a handle to the cuBLAS library context
cublasHandle_t cublas_handle;
cublasCreate(&cublas_handle);
// 2. Settings
const unsigned num_moduli = 14u;
const bool fastmode = false;
// 3. Matrix shapes
const size_t m1 = 64, n1 = 10, k1 = 10; // 1st GEMM: 64×10 × 10×10
const size_t m2 = 20, n2 = 10, k2 = 10; // 2nd GEMM: 20×10 × 10×10
// 4. Allocate workspace
size_t worksizeA, worksizeB;
const size_t worksize = gemmul8::workSize(
std::max(m1, m2), std::max(n1, n2), std::max(k1, k2),
num_moduli, false, true, &worksizeA, &worksizeB);
void *work;
cudaMalloc(&work, worksize);
int8_t *workA = reinterpret_cast<int8_t *>(work); // fixed workspace for A
int8_t *workB = workA + offsetA; // fixed workspace for B
int8_t *work_rem = workB + offsetB; // remaining workspace
// 5. Run GEMM (first call: scaling performed)
gemmul8::gemm(cublas_handle,
CUBLAS_OP_N, CUBLAS_OP_N,
m1, n1, k1,
&alpha, devA1, lda,
devB, ldb,
&beta, devC, ldc,
num_moduli, fastmode, (void*)work_rem, (void*)workA, (void*)workB,
false, true, false, false);
// 6. Reuse scaled A (second call: skip_scalB = true)
gemmul8::gemm(cublas_handle,
CUBLAS_OP_N, CUBLAS_OP_N,
m1, n1, k1,
&alpha, devA1, lda,
devB, ldb,
&beta, devC, ldc,
num_moduli, fastmode, (void*)work_rem, (void*)workA, (void*)workB,
false, true, false, true);
// 7. Free workspace
cudaFree(work);
// 8. Destroy a handle
cublasDestroy(cublas_andle);Intercept standard GEMM calls automatically without modifying the application source code.
cublasSgemm,cublasDgemm,cublasCgemm,cublasZgemmcublasSgemm_v2,cublasDgemm_v2,cublasCgemm_v2,cublasZgemm_v2cublasGemmExhipblasSgemm,hipblasDgemm,hipblasCgemm,hipblasZgemmhipblasGemmEx,hipblasGemmEx_v2
- Build the library.
- Set the
LD_PRELOADenvironment variable.
export LD_PRELOAD=/path-to-GEMMul8/lib/libgemmul8.so- Run your application.
export GEMMUL8_NUM_MOD_D=15
export GEMMUL8_NUM_MOD_S=7
export GEMMUL8_NUM_MOD_Z=15
export GEMMUL8_NUM_MOD_C=7
export GEMMUL8_FASTMODE_D=1
export GEMMUL8_FASTMODE_S=0
export GEMMUL8_FASTMODE_Z=1
export GEMMUL8_FASTMODE_C=0
export GEMMUL8_MAX_M=4096
export GEMMUL8_MAX_N=4096
export GEMMUL8_MAX_K=4096
export GEMMUL8_MAX_NUM_MOD=18
export GEMMUL8_SKIP_SCALE_A=1
export GEMMUL8_SKIP_SCALE_B=1| Variable | Default | Applies to | Description |
|---|---|---|---|
GEMMUL8_NUM_MOD_D |
0 |
DGEMM | Number of moduli (unsigned num_moduli) used in DGEMM emulation. Controls accuracy. |
GEMMUL8_NUM_MOD_S |
0 |
SGEMM | Number of moduli (unsigned num_moduli) used in SGEMM emulation. Controls accuracy. |
GEMMUL8_NUM_MOD_Z |
0 |
ZGEMM | Number of moduli (unsigned num_moduli) used in ZGEMM emulation. Controls accuracy. |
GEMMUL8_NUM_MOD_C |
0 |
CGEMM | Number of moduli (unsigned num_moduli) used in CGEMM emulation. Controls accuracy. |
GEMMUL8_FASTMODE_D |
0 |
DGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_FASTMODE_S |
0 |
SGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_FASTMODE_Z |
0 |
ZGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_FASTMODE_C |
0 |
CGEMM | Enables fast mode (1 = fast mode, 0 = accurate mode). |
GEMMUL8_MAX_M |
0 |
all | Maximum value of M used to preallocate workspace memory. |
GEMMUL8_MAX_N |
0 |
all | Maximum value of N used to preallocate workspace memory. |
GEMMUL8_MAX_K |
0 |
all | Maximum value of K used to preallocate workspace memory. |
GEMMUL8_MAX_NUM_MOD |
2 |
all | Maximum number of moduli used when computing the size of the preallocated workspace. |
GEMMUL8_SKIP_SCALE_A |
0 |
all | Enables skipping redundant preprocessing for A (1 = enable, 0 = disable). |
GEMMUL8_SKIP_SCALE_B |
0 |
all | Enables skipping redundant preprocessing for B (1 = enable, 0 = disable). |
- If
GEMMUL8_NUM_MOD_{S|D|Z|C} < 2 || 20 < GEMMUL8_NUM_MOD_{S|D|Z|C}, execute the corresponding native gemm routine. - This hook mode preallocates a single large GPU workspace on demand and resizes as needed.
- Each
cublasHandle_tmaintains an independent workspace. - The workspace is automatically released when the corresponding handle is destroyed via
cublasDestroy()orcublasDestroy_v2(). - Workspace size is determined as
max(wsmax, ws, pre_ws), wherewsmax = workSize(GEMMUL8_MAX_M, GEMMUL8_MAX_N, GEMMUL8_MAX_K, GEMMUL8_MAX_NUM_MOD)ws = workSize(m, n, k, GEMMUL8_NUM_MOD_{D|S|Z|C})pre_ws =the size of the previously allocated workspace for the samecublasHandle_t
- If
GEMMUL8_MAX_{M|N|K|NUM_MOD}variables are set appropriately, the workspace size becomes fixed towsmax, avoiding costly reallocations during subsequent GEMM calls. - If
GEMMUL8_NUM_MOD_Z > 0orGEMMUL8_NUM_MOD_C > 0,wsmaxis workspace size for complex cases. - The workspace is never shrunk automatically; it only grows when a larger allocation is required.
- When
GEMMUL8_SKIP_SCALE_{A|B}=1, redundant preprocessing forA/Bis skipped (see below).
Important
GEMMUL8_SKIP_SCALE_{A|B}=1allows consecutive GEMM calls using the same matrix pointers (A/B) to reuse already-scaled intermediate data.- Automatic skip of conversion from
AorBto INT8 is enabled when:GEMMUL8_SKIP_SCALE_{A|B}=1.- The same computation mode (fast mode or accurate mode) is used for both the previous and current calls.
- The same
cublasHandle_tis used for both the previous and current calls. - The same device pointer (
AorB) is used for both the previous and current calls. - The same matrix shapes (
M/K/ldaforA,K/N/ldbforB) are used for both the previous and current calls. - The same operation flag (
transaforA,transbforB) is used for both the previous and current calls. - The same number of moduli (
GEMMUL8_NUM_MOD_{D|S|Z|C}) is used for both the previous and current calls. - The same workspace size (
ws) is used for both the previous and current calls.
Tip
To ensure the last condition is met, it is recommended to set the GEMMUL8_MAX_{M|N|K|NUM_MOD} variables.
This fixes the workspace size to wsmax, preventing it from changing between calls.
Caution
A or B remain unchanged in GPU memory.
If their contents are modified between GEMM calls, set GEMMUL8_SKIP_SCALE_{A|B}=0 to ensure correctness.
You can also set these environment variables programmatically from within your code using setenv.
char num_moduli[12];
// Run emulation with num_moduli = 15 & fastmode = true
snprintf(num_moduli, sizeof(num_moduli), "%u", 15u);
setenv("GEMMUL8_NUM_MOD_D", num_moduli, 1);
setenv("GEMMUL8_FASTMODE_D", "1", 1);
cublasDgemm_v2(...);
// Run emulation with num_moduli = 18 & fastmode = false
snprintf(num_moduli, sizeof(num_moduli), "%u", 18u);
setenv("GEMMUL8_NUM_MOD_D", num_moduli, 1);
setenv("GEMMUL8_FASTMODE_D", "0", 1);
cublasDgemm_v2(...);See numerical results in the separate repository: GEMMul8_numerical_results
Caution
Please do not contact the individuals listed below regarding this code.
- Patrick Gutsche (École Normale Supérieure de Lyon, France)
- Prajval Kumar (Indian Institute of Science and Education Research, India)
- Dr. William Dawson (RIKEN Center for Computational Science, Japan)
- Dr. Toshiyuki Imamura (RIKEN Center for Computational Science, Japan)
(Affiliations as of 2025)
- Dr. Qianxiang Ma (RIKEN Center for Computational Science, Japan)
- Prof. Rio Yokota (Institute of Science Tokyo, Japan)
(Affiliations as of 2025)
- Yuki Uchino (RIKEN Center for Computational Science, Japan)
- yuki.uchino.fe (at) riken.jp
- Ootomo, H., & Yokota, R. (2022). Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance. The International Journal of High Performance Computing Applications, 36(4), 475-491, doi.org/10.1177/10943420221090256.
- Ootomo, H., Manabe, H., Harada, K., & Yokota, R. (2023). Quantum Circuit Simulation by SGEMM Emulation on Tensor Cores and Automatic Precision Selection. In High Performance Computing (pp. 259-276). Springer, doi.org/10.1007/978-3-031-32041-5_14.
- Ootomo, H., Ozaki, K., & Yokota, R. (2024). DGEMM on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, 38(4), 297-313, https://doi.org/10.1177/10943420241239588.
- Uchino, Y., Ozaki, K., & Imamura, T. (2025). Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, 39(3), 462-476, doi.org/10.1177/10943420241313064.
@inproceedings{10.1145/3731599.3767539,
author = {Uchino, Yuki and Ozaki, Katsuhisa and Imamura, Toshiyuki},
title = {High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines},
year = {2025},
isbn = {9798400718717},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3731599.3767539},
doi = {10.1145/3731599.3767539},
booktitle = {Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
pages = {1824-1831},
numpages = {8},
series = {SC Workshops '25}
}and
@misc{ozaki2025ozakischemeiigemmoriented,
title={Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique},
author={Katsuhisa Ozaki and Yuki Uchino and Toshiyuki Imamura},
year={2025},
eprint={2504.08009},
archivePrefix={arXiv},
primaryClass={cs.MS},
url={https://arxiv.org/abs/2504.08009},
}This project is licensed under the MIT License. See the LICENSE file for details.