Skip to content
View BBuf's full-sized avatar

Block or report BBuf

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Autonomous GPU kernel optimization system driven by AI agents.

Python 29 Updated Mar 19, 2026

A PyTorch native platform for training generative AI models

Python 5,175 756 Updated Mar 23, 2026

If you want to purchase Panzhihua Mi Yi Pipa, please contact me.

11 1 Updated Mar 16, 2026

Automated CUDA kernel performance diagnostics from NVIDIA Nsight Compute (NCU) CSV exports.

Rust 25 Updated Mar 18, 2026

Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.

Python 780 65 Updated Mar 19, 2026

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Python 1,753 165 Updated Mar 23, 2026

Terminal UI for NVIDIA Nsight Systems profiles — timeline viewer, kernel navigator, NVTX hierarchy

Python 45 8 Updated Mar 23, 2026

Humanizer 的汉化版本,Claude Code Skills,旨在消除文本中 AI 生成的痕迹。

5,106 428 Updated Jan 19, 2026

From Minimal GEMM to Everything

Cuda 188 10 Updated Feb 10, 2026

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, e…

TypeScript 74,172 14,825 Updated Mar 23, 2026

An agentic skills framework & software development methodology that works.

Shell 107,425 8,621 Updated Mar 19, 2026

FlashInfer: Kernel Library for LLM Serving

Python 5,205 821 Updated Mar 23, 2026

《开源大模型食用指南》针对中国宝宝量身打造的基于Linux环境快速微调(全参数/Lora)、部署国内外开源大模型(LLM)/多模态大模型(MLLM)教程

Jupyter Notebook 29,213 2,873 Updated Mar 22, 2026

FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels

Python 153 109 Updated Mar 20, 2026

High Performance LLM Inference Operator Library

C++ 793 74 Updated Feb 5, 2026

High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)

Cuda 30 1 Updated Jan 22, 2026

Accelerating MoE with IO and Tile-aware Optimizations

Python 613 66 Updated Mar 23, 2026

A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.

Python 3,775 517 Updated Mar 13, 2026

a size profiler for cuda binary

Python 71 Updated Jan 15, 2026

NVIDIA cuTile learn

Python 165 2 Updated Dec 9, 2025

A PyTorch-native inference engine with hybrid cache acceleration and massive parallelism for DiTs.

Python 1,104 66 Updated Mar 23, 2026

GPU programming related news and material links

2,060 120 Updated Mar 8, 2026

GPU documentation for humans

Python 544 66 Updated Jan 27, 2026

Train speculative decoding models effortlessly and port them smoothly to SGLang serving.

Python 741 186 Updated Mar 21, 2026

Expert Specialization MoE Solution based on CUTLASS

Cuda 27 2 Updated Jan 19, 2026

A Quirky Assortment of CuTe Kernels

Python 863 98 Updated Mar 23, 2026

Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocations such as NCCL, ...)

Python 94 7 Updated Sep 11, 2025

🌈 Solutions of LeetGPU

Cuda 76 11 Updated Mar 3, 2026

Nano vLLM

Python 12,390 1,772 Updated Nov 3, 2025

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

Python 3,962 311 Updated Mar 23, 2026
Next