wangfakang

sky wangfakang

239 followers · 77 following

AntGroup
HangZhou
01:25 (UTC +08:00)
http://wangfakang.github.io

Achievements

Organizations

Lists (3)

Sort

Stars

THUDM / IndexCache

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

54 5 Updated Mar 14, 2026

NVIDIA / TileGym

Helpful kernel tutorials and examples for tile-based GPU programming

Python 684 55 Updated Mar 26, 2026

antgroup / DeepXTrace

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 95 7 Updated Jan 16, 2026

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,888 265 Updated Mar 24, 2026

datawhalechina / vibe-vibe

AI for All: The First Systematic Vibe Coding Tutorial | From Zero to Full-Stack, Bring Your Ideas to Life | Live at: www.vibevibe.cn ；全民AI学习第一课，首个系统化 Vibe Coding 开源教程 | 零基础到全栈实战，让人人都能借助 AI 实现自己的想法与…

Dockerfile 4,170 342 Updated Mar 18, 2026

mental2008 / awesome-papers

Here are my personal paper reading notes (including machine learning systems, AI infrastructure, and other interesting stuffs).

171 10 Updated Jan 27, 2026

ByteDance-Seed / Triton-distributed

Distributed Compiler based on Triton for Parallel Systems

Python 1,396 134 Updated Mar 11, 2026

QwenLM / qwen-code

An open-source AI agent that lives in your terminal.

TypeScript 21,099 1,886 Updated Mar 26, 2026

Dao-AILab / sonic-moe

Accelerating MoE with IO and Tile-aware Optimizations

Python 614 67 Updated Mar 24, 2026

aikitoria / nanotrace

Low overhead tracing library and trace visualizer for pipelined CUDA kernels

C 135 6 Updated Nov 26, 2025

infinigence / FUSCO

High-performance distributed data shuffling (all-to-all) library for MoE training and inference

Python 117 11 Updated Mar 7, 2026

nex-agi / NexVenusCL

Nex Venus Communication Library

C++ 74 7 Updated Nov 17, 2025

NVIDIA / cutile-python

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,998 130 Updated Mar 26, 2026

NVIDIA / cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,496 1,750 Updated Mar 24, 2026

sgl-project / sgl-kernel-npu

SGLang kernel library for NPU

C++ 109 98 Updated Mar 26, 2026

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,430 486 Updated Mar 26, 2026

cchan / tccl

extensible collectives library in triton

Python 98 6 Updated Mar 31, 2025

sii-research / VCCL

Venus Collective Communication Library, supported by SII and Infrawaves.

C++ 141 7 Updated Mar 4, 2026

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 37,604 16,678 Updated Mar 26, 2026

NVIDIA / nsight-vscode-edition

A Visual Studio Code extension for building and debugging CUDA applications.

TypeScript 102 17 Updated Feb 14, 2026

uccl-project / uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,241 132 Updated Mar 26, 2026