Skip to content
View demonbibi's full-sized avatar

Block or report demonbibi

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A kernel library written in tilelang

Python 1,597 142 Updated Apr 23, 2026

A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It helps improve custom GPU operators with reproducible workflo…

Python 180 17 Updated Apr 22, 2026

Accelerating MoE with IO and Tile-aware Optimizations

Python 719 90 Updated Jun 15, 2026

A Quirky Assortment of CuTe Kernels

Python 1,026 136 Updated Jun 20, 2026

Persistent file-based planning for AI coding agents and long-running agentic tasks. Crash-proof markdown plans that survive context loss and /clear, plus a deterministic completion gate and multi-a…

Python 23,754 2,074 Updated Jun 16, 2026

An agentic skills framework & software development methodology that works.

Shell 235,760 20,926 Updated Jun 22, 2026

AI agents running research on single-GPU nanochat training automatically

Python 88,084 12,754 Updated Mar 26, 2026

高性能短序列稀疏Mask Attention CUDA算子,针对<1K序列+75%稀疏度优化

Python 79 8 Updated Mar 18, 2026

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

Cuda 441 28 Updated Mar 30, 2026

A Next-Generation Training Engine Built for Ultra-Large MoE Models

Python 5,150 425 Updated Jun 22, 2026

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

Jupyter Notebook 1,075 174 Updated Mar 24, 2026

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Python 134 14 Updated Jun 14, 2025

High Performance LLM Inference Operator Library

C++ 956 97 Updated Jun 11, 2026
C++ 183 45 Updated May 11, 2026
C++ 122 19 Updated May 16, 2025

Fused SwiGLU Triton kernels

Python 13 4 Updated Jan 25, 2024

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 83 8 Updated Apr 26, 2025
Python 130 11 Updated Sep 22, 2025

From Minimal GEMM to Everything

Python 221 12 Updated Jun 8, 2026

Triton implementation of FlashAttention2 that adds Custom Masks.

Python 176 16 Updated Aug 14, 2024

[DEPRECATED] Moved to ROCm/rocm-libraries repo. NOTE: develop branch is maintained as a read-only mirror

C++ 537 300 Updated Jun 22, 2026

Efficient Triton Kernels for LLM Training

Python 6,448 543 Updated Jun 17, 2026

Collection of kernels written in Triton language

199 10 Updated Jan 27, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,400 1,059 Updated Jun 4, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,751 1,293 Updated Jun 15, 2026

Repository hosting code for "Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations" (https://arxiv.org/abs/2402.17152).

Python 1,928 394 Updated Jun 18, 2026

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 38,924 17,572 Updated Jun 22, 2026

compiler learning resources collect.

Python 2,749 370 Updated May 20, 2026
Next