Skip to content

RIKEN-RCCS/GEMMul8

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GEMMul8

GEMMul8 (GEMMulate): GEMM emulation using int8 matrix engines based on the Ozaki Scheme II

GEMMul8 is a library for emulating high-precision matrix multiplication (SGEMM/DGEMM/CGEMM/ZGEMM) using low-precision INT8 matrix engines. It is based on the Ozaki Scheme II, enabling bit-wise reproducible results with superior performance and power efficiency compared to native floating-point implementations.

Technical Overview

GEMMul8 implements GEMM emulation based on Ozaki scheme II, which utilizes the Chinese Remainder Theorem (CRT). A larger number of moduli (num_moduli) for the CRT results in higher precision at the cost of increased computation time.

This project supports both CUDA and HIP backends for GPU computation.

The GEMM emulation ensures bit-wise numerical reproducibility. This is achieved through a combination of exact, error-free computations and a fixed order of operations.

However, please note that this reproducibility guarantee may not extend to environments using different toolkits or compilers. Discrepancies can arise from varying propagation rules for Inf/NaN values or differences in the precision of underlying math library functions.

Build

Run make in the GEMMUl8/GEMMul8 directory to compile all source files.

make -j8

To rebuild from scratch, run make clean && make -j8.

make options

Option Default Description
CUDA_PATH /usr/local/cuda Path to your CUDA toolkit installation. Used for CUDA backends.
HIP_PATH /opt/rocm Path to your HIP (ROCm) toolkit installation. Used for HIP backends.
BACKEND auto Select GPU backend: cuda, hip, or auto (auto-detect).
GPU_ARCH auto Target GPU architecture.
Examples: 80 (A100), 90 (H100), gfx90a (MI250X), gfx942 (MI300X).
ozIMMU_EF no Set to yes to use ozIMMU in the sample code.
ozIMMU_EF_DIR ../../ozIMMU_EF Path to ozIMMU implementation. Used if ozIMMU_EF=yes.
cuMpSGEMM no Set to yes to use cuMpSGEMM in the sample code.
cuMpSGEMM_DIR ../../cuMpSGEMM Path to cuMpSGEMM. Used if cuMpSGEMM=yes.
HIJACK no Set to yes to hijack cuBLAS GEMM calls with emulation in the sample code.

Note

Example

CUDA build

Build for an NVIDIA H100/H200 GPU (Compute Capability 9.0)

make -j8 BACKEND=cuda CUDA_PATH=/usr/local/cuda GPU_ARCH=90

HIP build

Build for an AMD MI300X GPU (gfx942 architecture)

make -j8 BACKEND=hip HIP_PATH=/opt/rocm GPU_ARCH=gfx942

Running Sample Codes

After building the project, you can run sample codes from the testing directory.

How to Run

Navigate to the testing directory and use make to run tests for different precisions.

make options

Mode Value Description
test_s (no) Tests for SGEMM
test_d (no) Tests for DGEMM
test_c (no) Tests for CGEMM
test_z (no) Tests for ZGEMM
MODE accuracy Tests numerical accuracy (maximum element-wise relative error).
flops Measures TFLOPS with square matrices.
watt Measures watt and GFLOPS/watt with square matrices.
flops_rect Measures TFLOPS with rectangular matrices.
watt_rect Measures watt and GFLOPS/watt with rectangular matrices.
all Runs all available test modes.

Example

# Run accuracy test (DGEMM)
make test_d MODE="accuracy"
# Run all tests (SGEMM)
make test_s MODE="all"
# Run accuracy, flops, & watt tests (SGEMM & DGEMM)
make test_s test_d MODE="accuracy flops watt"

Usage

This library provides two ways to use the GEMM emulation.

1. Direct Usage (Normal mode)

Call GEMMul8 functions explicitly from your source code. This gives you fine-grained control over the emulation parameters.

Example: run emulation for the CUDA backend

#include "gemmul8.hpp"

// 1. Create a handle to the cuBLAS library context
cublasHandle_t cublas_handle;
cublasCreate(&cublas_handle);

// 2. Settings
const unsigned num_moduli = 14u;  // Accuracy knob: 2 <= num_moduli <= 20
const bool fastmode = true;       // true (fast mode) or false (accurate mode)

// 3. Allocate workspace
const size_t worksize = gemmul8::workSize(m, n, k, num_moduli); // calculate required memory (Byte)
void *work;
cudaMalloc(&work, worksize);

// 4. (Optional) Create a vector to store timing breakdown
std::vector<double> time_breakdown(4, 0.0);

// 5. Run emulation
// The function returns a vector with execution times for each phase.
time_breakdown = gemmul8::gemm(cublas_handle,
                               CUBLAS_OP_N, CUBLAS_OP_N,
                               m, n, k,
                               &alpha, devA, lda,
                               devB, ldb,
                               &beta, devC, ldc,
                               num_moduli, fastmode, work);

// 6. Free workspace
cudaFree(work);

// 7. Destroy a handle
cublasDestroy(cublas_andle);

Arguments of gemmul8::gemm

The arguments for gemmul8::gemm closely match the standard cublas<t>gemm and hipblasXgemm interfaces, with additional parameters that control internal preprocessing and workspace reuse.

GEMMul8 provides both a workspace query function (workSize) and an extended GEMM computation interface (gemm), as shown below:

namespace gemmul8 {

// workSize returns the required workspace size in bytes.
template <bool is_Complex = false>          // [option] If true, return workspace size for CGEMM/ZGEMM
size_t workSize(
    const size_t m,                         // Number of rows of C
    const size_t n,                         // Number of columns of C
    const size_t k,                         // Inner dimension <= 2^17
    const unsigned num_moduli,              // #moduli, 2 <= num_moduli <= 20
    const bool enable_skip_scalA = false,   // [option] Reserve extra space for A to allow skip_scalA
    const bool enable_skip_scalB = false,   // [option] Reserve extra space for B to allow skip_scalB
    size_t *workSizeA            = nullptr, // [option] Output: workspace size used for A8i and sftA
    size_t *workSizeB            = nullptr  // [option] Output: workspace size used for B8i and sftB
);

// gemm returns computation time in second of each computational phase
// gemm returns computation time in second of each computational phase
#if defined(__NVCC__)
template <typename T> std::vector<double> gemm(
    cublasHandle_t handle,                  // Handle to the cuBLAS library context
    const cublasOperation_t op_A,           // CUBLAS_OP_N or CUBLAS_OP_T
    const cublasOperation_t op_B,           // CUBLAS_OP_N or CUBLAS_OP_T
    const size_t m,                         // Number of rows of C
    const size_t n,                         // Number of columns of C
    const size_t k,                         // Inner dimension <= 2^17
    const T *alpha,                         // Scaling factor for op(A)*op(B)
    const T *const A,                       // 1-D device array of dimensions lda*k (CUBLAS_OP_N) or lda*m (CUBLAS_OP_T)
    const size_t lda,                       // Leading dimension of A
    const T *const B,                       // 1-D device array of dimensions ldb*n (CUBLAS_OP_N) or ldb*k (CUBLAS_OP_T)
    const size_t ldb,                       // Leading dimension of B
    const T *beta,                          // Scaling factor for C
    T *const C,                             // 1-D device array of dimensions ldc*n
    const size_t ldc,                       // Leading dimension of C
    const unsigned num_moduli,              // #moduli, 2 <= num_moduli <= 20
    const bool fastmode,                    // false (accurate mode) or true (fast mode)
    void *const work,                       // Preallocated workspace
    void *const workA            = nullptr, // [optional] Separate workspace for A (if nullptr, uses work)
    void *const workB            = nullptr, // [optional] Separate workspace for B (if nullptr, uses work)
    const bool enable_skip_scalA = false,   // [optional] Enables scaling-skip mechanism for A
    const bool enable_skip_scalB = false,   // [optional] Enables scaling-skip mechanism for B
    bool skip_scalA              = false,   // [optional] If true, skip preprocessing for A
    bool skip_scalB              = false    // [optional] If true, skip preprocessing for B
);
#endif

#if defined(__HIPCC__)
template <typename T> std::vector<double> gemm(
    hipblasHandle_t handle,                 // Handle to the cuBLAS library context
    const hipblasOperation_t op_A,          // CUBLAS_OP_N or CUBLAS_OP_T
    const hipblasOperation_t op_B,          // CUBLAS_OP_N or CUBLAS_OP_T
    const size_t m,                         // Number of rows of C
    const size_t n,                         // Number of columns of C
    const size_t k,                         // Inner dimension <= 2^17
    const T *alpha,                         // Scaling factor for op(A)*op(B)
    const T *const A,                       // 1-D device array of dimensions lda*k (CUBLAS_OP_N) or lda*m (CUBLAS_OP_T)
    const size_t lda,                       // Leading dimension of A
    const T *const B,                       // 1-D device array of dimensions ldb*n (CUBLAS_OP_N) or ldb*k (CUBLAS_OP_T)
    const size_t ldb,                       // Leading dimension of B
    const T *beta,                          // Scaling factor for C
    T *const C,                             // 1-D device array of dimensions ldc*n
    const size_t ldc,                       // Leading dimension of C
    const unsigned num_moduli,              // #moduli, 2 <= num_moduli <= 20
    const bool fastmode,                    // false (accurate mode) or true (fast mode)
    void *const work,                       // Preallocated workspace
    void *const workA            = nullptr, // [optional] Separate workspace for A (if nullptr, uses work)
    void *const workB            = nullptr, // [optional] Separate workspace for B (if nullptr, uses work)
    const bool enable_skip_scalA = false,   // [optional] Enables scaling-skip mechanism for A
    const bool enable_skip_scalB = false,   // [optional] Enables scaling-skip mechanism for B
    bool skip_scalA              = false,   // [optional] If true, skip preprocessing for A
    bool skip_scalB              = false    // [optional] If true, skip preprocessing for B
);
#endif

} // namespace gemmul8

Behavior of skip_scalA / skip_scalB

  • gemmul8::gemm internally converts the input matrices A and B into INT8 format and performs modular multiplications across multiple moduli.
  • The conversion (scaling) step can be skipped in consecutive GEMM calls if the same matrices are reused, allowing for substantial performance gains.
  • If enable_skip_scal{A|B} = true, additional workspace is preallocated so that the scaled INT8 representation of A/B can be retained between calls.
  • If both enable_skip_scal{A|B} && skip_scal{A|B} = true, the scaling step for A/B is actually skipped, and previously prepared data are reused for faster execution. This mode assumes that the contents of A/B in device memory remain unchanged.
  • This mechanism is particularly effective when repeatedly multiplying the same A (or B) with different B (or A) matrices.

Note

When using skip_scalA / skip_scalB, the preprocessing step that converts A/B into its internal INT8 representation is skipped. For correctness, the following conditions must all hold between consecutive GEMM calls:

  1. The dimensions (M/K for A, K/N for B) must be identical to those in the previous call.
  2. The operation type (CUBLAS_OP_N / CUBLAS_OP_T) for A/B must be the same as before.
  3. The value of num_moduli must remain unchanged.
  4. The fastmode setting must be identical to that of the previous call.
  5. The contents of A/B in device memory must not be modified between calls.

Caution

If any of these conditions differ, the cached scaled data become invalid, and skipping must not be used. In such cases, set skip_scalA=false / skip_scalB=false.

Caution

This skip mechanism is designed for repeated GEMM calls with identical A or B. Use it only when you are certain that the input matrices and configuration have not changed. When in doubt, disable skipping to ensure correctness.

Example: How to skip scaling step

#include "gemmul8.hpp"

// 1. Create a handle to the cuBLAS library context
cublasHandle_t cublas_handle;
cublasCreate(&cublas_handle);

// 2. Settings
const unsigned num_moduli = 14u;
const bool fastmode = false;

// 3. Matrix shapes
const size_t m1 = 64, n1 = 10, k1 = 10; // 1st GEMM: 64×10 × 10×10
const size_t m2 = 20, n2 = 10, k2 = 10; // 2nd GEMM: 20×10 × 10×10

// 4. Allocate workspace
size_t worksizeA, worksizeB;
const size_t worksize = gemmul8::workSize(
    std::max(m1, m2), std::max(n1, n2), std::max(k1, k2),
    num_moduli, false, true, &worksizeA, &worksizeB);

void *work;
cudaMalloc(&work, worksize);

int8_t *workA = reinterpret_cast<int8_t *>(work); // fixed workspace for A
int8_t *workB = workA + offsetA;                  // fixed workspace for B
int8_t *work_rem = workB + offsetB;               // remaining workspace

// 5. Run GEMM (first call: scaling performed)
gemmul8::gemm(cublas_handle,
              CUBLAS_OP_N, CUBLAS_OP_N,
              m1, n1, k1,
              &alpha, devA1, lda,
              devB, ldb,
              &beta, devC, ldc,
              num_moduli, fastmode, (void*)work_rem, (void*)workA, (void*)workB,
              false, true, false, false);

// 6. Reuse scaled A (second call: skip_scalB = true)
gemmul8::gemm(cublas_handle,
              CUBLAS_OP_N, CUBLAS_OP_N,
              m1, n1, k1,
              &alpha, devA1, lda,
              devB, ldb,
              &beta, devC, ldc,
              num_moduli, fastmode, (void*)work_rem, (void*)workA, (void*)workB,
              false, true, false, true);

// 7. Free workspace
cudaFree(work);

// 8. Destroy a handle
cublasDestroy(cublas_andle);

2. Hijack cuBLAS/hipBLAS GEMM (Hook Mode)

Intercept standard GEMM calls automatically without modifying the application source code.

Interception target

  • cublasSgemm, cublasDgemm, cublasCgemm, cublasZgemm
  • cublasSgemm_v2, cublasDgemm_v2, cublasCgemm_v2, cublasZgemm_v2
  • cublasGemmEx
  • hipblasSgemm, hipblasDgemm, hipblasCgemm, hipblasZgemm
  • hipblasGemmEx, hipblasGemmEx_v2

How to enable the hook

  1. Build the library.
  2. Set the LD_PRELOAD environment variable.
export LD_PRELOAD=/path-to-GEMMul8/lib/libgemmul8.so
  1. Run your application.

Configure emulation parameters via environment variables

export GEMMUL8_NUM_MOD_D=15
export GEMMUL8_NUM_MOD_S=7
export GEMMUL8_NUM_MOD_Z=15
export GEMMUL8_NUM_MOD_C=7
export GEMMUL8_FASTMODE_D=1
export GEMMUL8_FASTMODE_S=0
export GEMMUL8_FASTMODE_Z=1
export GEMMUL8_FASTMODE_C=0
export GEMMUL8_MAX_M=4096
export GEMMUL8_MAX_N=4096
export GEMMUL8_MAX_K=4096
export GEMMUL8_MAX_NUM_MOD=18
export GEMMUL8_SKIP_SCALE_A=1
export GEMMUL8_SKIP_SCALE_B=1
Variable Default Applies to Description
GEMMUL8_NUM_MOD_D 0 DGEMM Number of moduli (unsigned num_moduli) used in DGEMM emulation. Controls accuracy.
GEMMUL8_NUM_MOD_S 0 SGEMM Number of moduli (unsigned num_moduli) used in SGEMM emulation. Controls accuracy.
GEMMUL8_NUM_MOD_Z 0 ZGEMM Number of moduli (unsigned num_moduli) used in ZGEMM emulation. Controls accuracy.
GEMMUL8_NUM_MOD_C 0 CGEMM Number of moduli (unsigned num_moduli) used in CGEMM emulation. Controls accuracy.
GEMMUL8_FASTMODE_D 0 DGEMM Enables fast mode (1 = fast mode, 0 = accurate mode).
GEMMUL8_FASTMODE_S 0 SGEMM Enables fast mode (1 = fast mode, 0 = accurate mode).
GEMMUL8_FASTMODE_Z 0 ZGEMM Enables fast mode (1 = fast mode, 0 = accurate mode).
GEMMUL8_FASTMODE_C 0 CGEMM Enables fast mode (1 = fast mode, 0 = accurate mode).
GEMMUL8_MAX_M 0 all Maximum value of M used to preallocate workspace memory.
GEMMUL8_MAX_N 0 all Maximum value of N used to preallocate workspace memory.
GEMMUL8_MAX_K 0 all Maximum value of K used to preallocate workspace memory.
GEMMUL8_MAX_NUM_MOD 2 all Maximum number of moduli used when computing the size of the preallocated workspace.
GEMMUL8_SKIP_SCALE_A 0 all Enables skipping redundant preprocessing for A (1 = enable, 0 = disable).
GEMMUL8_SKIP_SCALE_B 0 all Enables skipping redundant preprocessing for B (1 = enable, 0 = disable).
  • If GEMMUL8_NUM_MOD_{S|D|Z|C} < 2 || 20 < GEMMUL8_NUM_MOD_{S|D|Z|C}, execute the corresponding native gemm routine.
  • This hook mode preallocates a single large GPU workspace on demand and resizes as needed.
  • Each cublasHandle_t maintains an independent workspace.
  • The workspace is automatically released when the corresponding handle is destroyed via cublasDestroy() or cublasDestroy_v2().
  • Workspace size is determined as max(wsmax, ws, pre_ws), where
    • wsmax = workSize(GEMMUL8_MAX_M, GEMMUL8_MAX_N, GEMMUL8_MAX_K, GEMMUL8_MAX_NUM_MOD)
    • ws = workSize(m, n, k, GEMMUL8_NUM_MOD_{D|S|Z|C})
    • pre_ws = the size of the previously allocated workspace for the same cublasHandle_t
  • If GEMMUL8_MAX_{M|N|K|NUM_MOD} variables are set appropriately, the workspace size becomes fixed to wsmax, avoiding costly reallocations during subsequent GEMM calls.
  • If GEMMUL8_NUM_MOD_Z > 0 or GEMMUL8_NUM_MOD_C > 0, wsmax is workspace size for complex cases.
  • The workspace is never shrunk automatically; it only grows when a larger allocation is required.
  • When GEMMUL8_SKIP_SCALE_{A|B}=1, redundant preprocessing for A/B is skipped (see below).

Important

  • GEMMUL8_SKIP_SCALE_{A|B}=1 allows consecutive GEMM calls using the same matrix pointers (A/B) to reuse already-scaled intermediate data.
  • Automatic skip of conversion from A or B to INT8 is enabled when:
    1. GEMMUL8_SKIP_SCALE_{A|B}=1.
    2. The same computation mode (fast mode or accurate mode) is used for both the previous and current calls.
    3. The same cublasHandle_t is used for both the previous and current calls.
    4. The same device pointer (A or B) is used for both the previous and current calls.
    5. The same matrix shapes (M/K/lda for A, K/N/ldb for B) are used for both the previous and current calls.
    6. The same operation flag (transa for A, transb for B) is used for both the previous and current calls.
    7. The same number of moduli (GEMMUL8_NUM_MOD_{D|S|Z|C}) is used for both the previous and current calls.
    8. The same workspace size (ws) is used for both the previous and current calls.

Tip

To ensure the last condition is met, it is recommended to set the GEMMUL8_MAX_{M|N|K|NUM_MOD} variables. This fixes the workspace size to wsmax, preventing it from changing between calls.

Caution

⚠️Note: Skip scaling assumes that the matrices A or B remain unchanged in GPU memory.
If their contents are modified between GEMM calls, set GEMMUL8_SKIP_SCALE_{A|B}=0 to ensure correctness.

How to change environment variables programmatically

You can also set these environment variables programmatically from within your code using setenv.

char num_moduli[12];

// Run emulation with num_moduli = 15 & fastmode = true
snprintf(num_moduli, sizeof(num_moduli), "%u", 15u);
setenv("GEMMUL8_NUM_MOD_D", num_moduli, 1);
setenv("GEMMUL8_FASTMODE_D", "1", 1);
cublasDgemm_v2(...);

// Run emulation with num_moduli = 18 & fastmode = false
snprintf(num_moduli, sizeof(num_moduli), "%u", 18u);
setenv("GEMMUL8_NUM_MOD_D", num_moduli, 1);
setenv("GEMMUL8_FASTMODE_D", "0", 1);
cublasDgemm_v2(...);

Numerical results

See numerical results in the separate repository: GEMMul8_numerical_results

Acknowledgment

Caution

Please do not contact the individuals listed below regarding this code.

Assistance with debugging

  • Patrick Gutsche (École Normale Supérieure de Lyon, France)
  • Prajval Kumar (Indian Institute of Science and Education Research, India)
  • Dr. William Dawson (RIKEN Center for Computational Science, Japan)
  • Dr. Toshiyuki Imamura (RIKEN Center for Computational Science, Japan)

(Affiliations as of 2025)

Assistance with experiments on a B200 GPU

  • Dr. Qianxiang Ma (RIKEN Center for Computational Science, Japan)
  • Prof. Rio Yokota (Institute of Science Tokyo, Japan)

(Affiliations as of 2025)

Contact (Responsible Developer)

  • Yuki Uchino (RIKEN Center for Computational Science, Japan)
  • yuki.uchino.fe (at) riken.jp

References

  • Ootomo, H., & Yokota, R. (2022). Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance. The International Journal of High Performance Computing Applications, 36(4), 475-491, doi.org/10.1177/10943420221090256.
  • Ootomo, H., Manabe, H., Harada, K., & Yokota, R. (2023). Quantum Circuit Simulation by SGEMM Emulation on Tensor Cores and Automatic Precision Selection. In High Performance Computing (pp. 259-276). Springer, doi.org/10.1007/978-3-031-32041-5_14.
  • Ootomo, H., Ozaki, K., & Yokota, R. (2024). DGEMM on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, 38(4), 297-313, https://doi.org/10.1177/10943420241239588.
  • Uchino, Y., Ozaki, K., & Imamura, T. (2025). Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, 39(3), 462-476, doi.org/10.1177/10943420241313064.

Citations

@inproceedings{10.1145/3731599.3767539,
    author = {Uchino, Yuki and Ozaki, Katsuhisa and Imamura, Toshiyuki},
    title = {High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines},
    year = {2025},
    isbn = {9798400718717},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3731599.3767539},
    doi = {10.1145/3731599.3767539},
    booktitle = {Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis},
    pages = {1824-1831},
    numpages = {8},
    series = {SC Workshops '25}
}

and

@misc{ozaki2025ozakischemeiigemmoriented,
      title={Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique},
      author={Katsuhisa Ozaki and Yuki Uchino and Toshiyuki Imamura},
      year={2025},
      eprint={2504.08009},
      archivePrefix={arXiv},
      primaryClass={cs.MS},
      url={https://arxiv.org/abs/2504.08009},
}

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

GEMMul8 (GEMMulate): GEMM emulation using int8 matrix engines based on the Ozaki Scheme II

Resources

License

Stars

Watchers

Forks

Packages

No packages published