-
AntGroup
- HangZhou
-
01:25
(UTC +08:00) - http://wangfakang.github.io
Lists (3)
Sort Name ascending (A-Z)
Stars
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
Helpful kernel tutorials and examples for tile-based GPU programming
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
how to optimize some algorithm in cuda.
AI for All: The First Systematic Vibe Coding Tutorial | From Zero to Full-Stack, Bring Your Ideas to Life | Live at: www.vibevibe.cn ;全民AI学习第一课,首个系统化 Vibe Coding 开源教程 | 零基础到全栈实战,让人人都能借助 AI 实现自己的想法与…
Here are my personal paper reading notes (including machine learning systems, AI infrastructure, and other interesting stuffs).
Distributed Compiler based on Triton for Parallel Systems
An open-source AI agent that lives in your terminal.
Accelerating MoE with IO and Tile-aware Optimizations
Low overhead tracing library and trace visualizer for pipelined CUDA kernels
High-performance distributed data shuffling (all-to-all) library for MoE training and inference
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Venus Collective Communication Library, supported by SII and Infrawaves.
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
A Visual Studio Code extension for building and debugging CUDA applications.
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
CUDA Python: Performance meets Productivity
FlagCX is a scalable and adaptive cross-chip communication library.
Pipeline Parallelism Emulation and Visualization
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
official implementation of paper SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation