wangfakang

sky wangfakang

241 followers · 77 following

AntGroup
HangZhou
03:15 (UTC +08:00)
http://wangfakang.github.io

Achievements

Organizations

Lists (3)

Sort

Stars

pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)

C++ 2,771 570 Updated Dec 18, 2025

yangwhale / gpu-tpu-pedia

Python 8 Updated Apr 3, 2026

thedotmack / claude-mem

A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future …

TypeScript 61,281 5,078 Updated Apr 17, 2026

THUDM / IndexCache

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

85 8 Updated Mar 14, 2026

NVIDIA / TileGym

Helpful kernel tutorials and examples for tile-based GPU programming

Python 702 65 Updated Apr 17, 2026

antgroup / DeepXTrace

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 95 7 Updated Jan 16, 2026

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,928 270 Updated Apr 16, 2026

datawhalechina / vibe-vibe

AI for All: The First Systematic Vibe Coding Tutorial | From Zero to Full-Stack, Bring Your Ideas to Life | Live at: www.vibevibe.cn ；全民AI学习第一课，首个系统化 Vibe Coding 开源教程 | 零基础到全栈实战，让人人都能借助 AI 实现自己的想法与…

Dockerfile 4,603 374 Updated Mar 18, 2026

mental2008 / awesome-papers

Here are my personal paper reading notes (including machine learning systems, AI infrastructure, and other interesting stuffs).

JavaScript 180 10 Updated Apr 12, 2026

ByteDance-Seed / Triton-distributed

Distributed Compiler based on Triton for Parallel Systems

Python 1,408 138 Updated Apr 10, 2026

QwenLM / qwen-code

An open-source AI agent that lives in your terminal.

TypeScript 23,450 2,223 Updated Apr 17, 2026

Dao-AILab / sonic-moe

Accelerating MoE with IO and Tile-aware Optimizations

Python 635 73 Updated Apr 15, 2026

aikitoria / nanotrace

Low overhead tracing library and trace visualizer for pipelined CUDA kernels

C 136 6 Updated Nov 26, 2025

infinigence / FUSCO

High-performance distributed data shuffling (all-to-all) library for MoE training and inference

Python 118 11 Updated Mar 7, 2026

nex-agi / NexVenusCL

Nex Venus Communication Library

C++ 74 7 Updated Nov 17, 2025

NVIDIA / cutile-python

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 2,023 131 Updated Apr 17, 2026

NVIDIA / cutlass

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,595 1,796 Updated Apr 17, 2026

sgl-project / sgl-kernel-npu

SGLang kernel library for NPU

C++ 125 111 Updated Apr 16, 2026

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,497 508 Updated Apr 17, 2026

cchan / tccl

extensible collectives library in triton

Python 98 6 Updated Mar 31, 2025

sii-research / VCCL

Venus Collective Communication Library, supported by SII and Infrawaves.

C++ 142 7 Updated Apr 13, 2026

llvm / llvm-project

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 37,912 16,883 Updated Apr 17, 2026

NVIDIA / nsight-vscode-edition

A Visual Studio Code extension for building and debugging CUDA applications.

TypeScript 103 17 Updated Feb 14, 2026

uccl-project / uccl

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,305 137 Updated Apr 17, 2026