demonbibi

Follow

demonbibi

Follow

3 followers · 5 following

Stars

deepseek-ai / TileKernels

A kernel library written in tilelang

Python 1,597 142 Updated Apr 23, 2026

KernelFlow-ops / cuda-optimized-skill

A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It helps improve custom GPU operators with reproducible workflo…

Python 180 17 Updated Apr 22, 2026

Dao-AILab / sonic-moe

Accelerating MoE with IO and Tile-aware Optimizations

Python 719 90 Updated Jun 15, 2026

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 1,026 136 Updated Jun 20, 2026

OthmanAdi / planning-with-files

Persistent file-based planning for AI coding agents and long-running agentic tasks. Crash-proof markdown plans that survive context loss and /clear, plus a deterministic completion gate and multi-a…

Python 23,754 2,074 Updated Jun 16, 2026

obra / superpowers

An agentic skills framework & software development methodology that works.

Shell 235,760 20,926 Updated Jun 22, 2026

karpathy / autoresearch

AI agents running research on single-GPU nanochat training automatically

Python 88,084 12,754 Updated Mar 26, 2026

xgbj / sparse-mask-attention

高性能短序列稀疏Mask Attention CUDA算子，针对<1K序列+75%稀疏度优化

Python 79 8 Updated Mar 18, 2026

deepreinforce-ai / CUDA-L2

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Cuda 441 28 Updated Mar 30, 2026

InternLM / xtuner

A Next-Generation Training Engine Built for Ultra-Large MoE Models

Python 5,150 425 Updated Jun 22, 2026

ScalingIntelligence / KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

Jupyter Notebook 1,075 174 Updated Mar 24, 2026

thunlp / TritonBench

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Python 134 14 Updated Jun 14, 2025

Tencent / hpc-ops

High Performance LLM Inference Operator Library

C++ 956 97 Updated Jun 11, 2026

MARD1NO / CUDA-PPT

136 29 Updated Apr 16, 2026

reed-lau / cute-gemm

C++ 183 45 Updated May 11, 2026

CalebDu / Awesome-Cute

C++ 122 19 Updated May 16, 2025

fattorib / fusedswiglu

Fused SwiGLU Triton kernels

Python 13 4 Updated Jan 25, 2024

DefTruth / CUDA-Learn-Notes

Forked from xlite-dev/LeetCUDA

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 83 8 Updated Apr 26, 2025

Starmys / TritonStudyGroup

Python 130 11 Updated Sep 22, 2025

ArthurinRUC / cutlass-notes

From Minimal GEMM to Everything

Python 221 12 Updated Jun 8, 2026

alexzhang13 / flashattention2-custom-mask

Triton implementation of FlashAttention2 that adds Custom Masks.

Python 176 16 Updated Aug 14, 2024

ROCm / composable_kernel

[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror

C++ 537 300 Updated Jun 22, 2026

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 6,448 543 Updated Jun 17, 2026

zinccat / Awesome-Triton-Kernels

Collection of kernels written in Triton language

199 10 Updated Jan 27, 2026

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,400 1,059 Updated Jun 4, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 9,751 1,293 Updated Jun 15, 2026

deepseek-ai / DeepSeek-V3

Python 103,783 16,733 Updated Aug 28, 2025

meta-recsys / generative-recommenders

Repository hosting code for "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).

Python 1,928 394 Updated Jun 18, 2026

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 38,924 17,572 Updated Jun 22, 2026

BBuf / tvm_mlir_learn

compiler learning resources collect.

Python 2,749 370 Updated May 20, 2026