[ICLR2025, ICML2025, NeurIPS2025 Spotlight] Quantized Attention achieves speedup of 2-5x compared to FlashAttention, without losing end-to-end metrics across language, image, and video models.

Cuda 3,302 398 Updated Jan 17, 2026

HuaiyuanXu / 3D-Occupancy-Perception

[Information Fusion 2025] A Survey on Occupancy Perception for Autonomous Driving: The Information Fusion Perspective

599 37 Updated Apr 12, 2026

prs-eth / LitePT

[CVPR 2026 Highlight] LitePT: Lighter Yet Stronger Point Transformer

Python 265 27 Updated Feb 21, 2026

sonnyli / flash_attention_from_scratch

Flash Attention from Scratch on CUDA Ampere

Assembly 166 28 Updated Sep 1, 2025

lcy-seso / DLFrameworkTest

My tests and experiments with some popular dl frameworks.

Python 17 2 Updated Sep 11, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

Cuda 10,288 1,045 Updated Apr 12, 2026

ArthurinRUC / cutlass-notes

From Minimal GEMM to Everything

Cuda 199 10 Updated Feb 10, 2026

shizhengLi / cutlass-learning

高性能 GPU 线性代数库深度解析与实战指南

HTML 3 Updated Sep 27, 2025

Huawei-CPLLab / PyDSL

A Python subset for a better MLIR programming experience

Python 53 9 Updated Mar 12, 2026

Pylir / Pylir

An optimizing ahead-of-time Python Compiler

C++ 260 14 Updated Jun 9, 2024

iree-org / iree-llvm-sandbox

A sandbox for quick iteration and experimentation on projects related to IREE, MLIR, and LLVM

Python 62 30 Updated Apr 13, 2026

thuml / depyf

depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.

Python 800 28 Updated Oct 13, 2025

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 759 58 Updated Aug 6, 2025

sdiehl / gpu-offload

Compile MLIR to PTX and execute it on NVIDIA GPUs

Jupyter Notebook 12 1 Updated Apr 16, 2025

b0nes164 / GPUSorting

State of the art sorting and segmented sorting, including OneSweep. Implemented in CUDA, D3D12, and Unity style compute shaders. Theoretically portable to all wave/warp/subgroup sizes.

Cuda 452 30 Updated Dec 14, 2024

nibrunie / rvv-examples

Example of RISC-V Vector programming

C 28 10 Updated Sep 4, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,926 270 Updated Apr 16, 2026

NVIDIA / cccl

CUDA Core Compute Libraries

C++ 2,277 378 Updated Apr 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

zhouleidcc

Block or report zhouleidcc

Stars

worldbench / awesome-3d-4d-world-models

slowlyC / agent-gpu-skills

Niko-zyf / QD-BEV

vita-epfl / MAD-World-Model-Code

BigCiLeng / bilateral-driving

OpenDriveLab / SimScale

franklingu / leetcode-solutions

autodriving-heart / Awesome-Autonomous-Driving

itcharge / AlgoNote

doocs / leetcode

shanguanma / nanoflash

qdLMF / LightGlue-with-FlashAttentionV2-TensorRT

thu-ml / SageAttention