Fast inference engine for Transformer models
-
Updated
Nov 29, 2025 - C++
Fast inference engine for Transformer models
Tuned OpenCL BLAS
Serial and parallel implementations of matrix multiplication
Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceleration.
Multiple GEMM operators are constructed with cutlass to support LLM inference.
DGEMM on KNL, achieve 75% MKL
My gemm optimization on RPi (ARM) achieved a 170x performance boost, showing speeds faster than Eigen and close to OpenBLAS.
Manually optimize the GEMM (GEneral Matrix Multiply) operation. There is a long way to go.
CUDA Gemm Convolution implementation
mixed-precision GEMM library
Development of deep learning inference code by OpenCL kerenl function.
Low Precision Arithmetic for Convolutional Neural Network Inference
WMMA GEMM in ROCm for RDNA GPUs
My experiments with convolution
Yet Another Machine Inference framework
OpenMP Matrix Multiplication Offloading Playground
Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.
To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."