This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, s…

Cuda 1,184 172 Updated Jul 29, 2023

sogou / workflow

C++ Parallel Computing and Asynchronous Networking Framework

C++ 14,158 2,551 Updated Nov 3, 2025

willhua / QualcommOpenCLSDKNote

The note of Qualcomm OpenCL SDK

C++ 36 9 Updated Nov 8, 2018

pigirons / sgemm_hsw

This is an implementation of sgemm_kernel on L1d cache.

Assembly 230 33 Updated Feb 26, 2024

wzh404 / cpu-cache-test

cpu cache延迟实验

C 1 Updated Jan 21, 2022

jcxz / OpenCL-correlation-using-local-memory

Correlation demo in OpenCL that uses local memory.

C 1 Updated Feb 24, 2015

ihaque / memtestCL

OpenCL memory tester for GPUs

C++ 144 26 Updated Jan 23, 2021

Tencent / libpag

The official rendering library for PAG (Portable Animated Graphics) files that renders After Effects animations natively across multiple platforms.

C++ 5,496 491 Updated Nov 6, 2025

vetter / shoc

The SHOC Benchmark Suite

Makefile 257 105 Updated Oct 6, 2025

782132930 / FFT

45 17 Updated Dec 18, 2020

OpenPPL / ppl.nn

A primitive library for neural network

C++ 1,367 222 Updated Nov 24, 2024

BBuf / ArmNeonOptimization

arm-neon

C++ 92 23 Updated Aug 2, 2024

Cjkkkk / CUDA_gemm

A simple high performance CUDA GEMM implementation.

Cuda 415 42 Updated Jan 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

yawen_Li a243845305

Block or report a243845305

Stars

tile-ai / tilelang

apache / tvm

triton-lang / triton

FlagTree / flagtree

FlagOpen / FlagGems

KEKE046 / mlir-tutorial

llvm / llvm-project

BBuf / tvm_mlir_learn

Yinghan-Li / YHs_Sample

nicolaswilde / cuda-sgemm

GinsengHoney / CUDA_Study

longv2go / NVIDIA-OpenCL-Samples

Li-Mingshuang / iphone_dcim_backup

OpenPPL / ppq

BBuf / how-to-optimize-gemm

MegEngine / mperf

MegEngine / MegPeak

Liu-xiandong / How_to_optimize_in_GPU