#

gemm

Here are 93 public repositories matching this topic...

ZhangGe6 / how-to-optimize-playground

High-performance computing (HPC) demos since I was a freshmen.

cuda x86 gemm

Updated Jun 15, 2022
C

dontdothatjoel / CUDA-GEMM-kernel

My attempt of making a GEMM kernel...

parallel-computing cuda cuda-kernels gemm gemm-optimization cuda-programming gemms

Updated Apr 21, 2025
Cuda

TensorBFS / CuTropicalGEMM.jl

The fastest Tropical number matrix multiplication on GPU

cuda gemm tropical-algebra

Updated Aug 23, 2025
Julia

BenQuickDeNN / CUDA-GEMM

CUDA version GEMM

cpp cuda gemm

Updated Mar 5, 2020
C++

prateekshukla1108 / pytorch-distributed-gemm

Pytorch Operation for distributed gemm in nvidia blackwell gpus

cuda gemm blackwell

Updated Jun 21, 2025
Cuda

cyrusmsk / gemm_apple

GEMM on Apple Silicon

benchmark deep-learning gemm applesilicon m1-mac

Updated Jul 19, 2025
Python

govindansriram / sm89-kernels

SM89 Optimized CUDA Kernels

cuda-kernels gemm cuda-programming

Updated Jun 5, 2025
Cuda

andreytkachenko / yarblas

Yet another rust BLAS

rust machine-learning math rust-lang blas gemm

Updated Feb 13, 2020
Rust

yester31 / OpenCL_EX

Development of deep learning inference code by OpenCL kerenl function.

opencl parallel-computing convolution deeplearning gemm

Updated Jun 1, 2022
C++

Avafly / optimize-gemm

My gemm optimization on RPi (ARM) achieved a 170x performance boost, showing speeds faster than Eigen and close to OpenBLAS.

c cpp hpc linear-algebra simd matrix-multiplication high-performance-computing blas gemm

Updated Nov 17, 2024
C++

KaiserKlayton / lpa_cnn

Low Precision Arithmetic for Convolutional Neural Network Inference

benchmarking caffe deep-learning image-recognition convolutional-neural-networks 8-bit gemm

Updated Oct 29, 2017
C++

digital-nomad-cheng / matmul_cuda_kernel_tvm

Generate optimized MatMul cuda kernel automatically using tvm auto schedule.

hpc gpu cuda gemm tvm gemm-optimization matmul

Updated Feb 25, 2023
Jupyter Notebook

Ra4ster / CSE2421-Lab6

Performance comparison of naive, AVX2-optimized, and cBLAS matrix multiplication implementations in C.

c matrix-multiplication blas avx2 performance-analysis cache-control gemm

Updated Nov 10, 2025
C

Tugbars / VectorGEMM

Safety-hardened GEMM (matrix multiply) implementation achieving 169.8 GFLOPS on Intel i9-14900. Built for embedded systems and safety-critical applications where reliability matters as much as speed. 162× faster than naive, zero UB, fully validated.

simd avx2 fma gemm loop-unrolling avx2-optimized

Updated Nov 21, 2025
C

snalo / SPOGA

Repo for the SPOGA Accelerator - Scaling Analog Photonic Accelerators for Byte-Size, Integer General Matrix Multiply (GEMM) Kernels

accelerator gemm photonics spoga

Updated Apr 3, 2025
Python

jhson989 / fast-conv

Fast Convoluion Implementation via CUDA

cuda convolution gemm

Updated Apr 26, 2022
Cuda

PkuCuipy / how-to-optimize-gemm-zh-notes

Learning notes of "flame/how-to-optimize-gemm".

tutorial x86-64 avx sse simd gemm

Updated Jun 25, 2024
C

5000user5000 / mpGEMM

mixed-precision GEMM library

cpp matrix gemm pybind11 lookup-table mixed-precision

Updated May 27, 2025
C++

vietfood / gemm_metal

metal gpu gemm matmul

Updated Oct 17, 2025
C++

DongqiShen / iLLM

Implementing LLM from scratch. (Developing...)

arm64 gemm llm-inference

Updated Nov 15, 2023
C

Improve this page

Add a description, image, and links to the gemm topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the gemm topic, visit your repo's landing page and select "manage topics."