Skip to content
View wangfakang's full-sized avatar

Organizations

@envoyproxy

Block or report wangfakang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

54 5 Updated Mar 14, 2026

Helpful kernel tutorials and examples for tile-based GPU programming

Python 684 55 Updated Mar 26, 2026

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 95 7 Updated Jan 16, 2026

how to optimize some algorithm in cuda.

Cuda 2,888 265 Updated Mar 24, 2026

AI for All: The First Systematic Vibe Coding Tutorial | From Zero to Full-Stack, Bring Your Ideas to Life | Live at: www.vibevibe.cn ;全民AI学习第一课,首个系统化 Vibe Coding 开源教程 | 零基础到全栈实战,让人人都能借助 AI 实现自己的想法与…

Dockerfile 4,170 342 Updated Mar 18, 2026

Here are my personal paper reading notes (including machine learning systems, AI infrastructure, and other interesting stuffs).

171 10 Updated Jan 27, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,396 134 Updated Mar 11, 2026

An open-source AI agent that lives in your terminal.

TypeScript 21,099 1,886 Updated Mar 26, 2026

Accelerating MoE with IO and Tile-aware Optimizations

Python 614 67 Updated Mar 24, 2026

Low overhead tracing library and trace visualizer for pipelined CUDA kernels

C 135 6 Updated Nov 26, 2025

High-performance distributed data shuffling (all-to-all) library for MoE training and inference

Python 117 11 Updated Mar 7, 2026

Nex Venus Communication Library

C++ 74 7 Updated Nov 17, 2025

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,998 130 Updated Mar 26, 2026

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,496 1,750 Updated Mar 24, 2026

SGLang kernel library for NPU

C++ 109 98 Updated Mar 26, 2026

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,430 486 Updated Mar 26, 2026

extensible collectives library in triton

Python 98 6 Updated Mar 31, 2025

Venus Collective Communication Library, supported by SII and Infrawaves.

C++ 141 7 Updated Mar 4, 2026

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 37,604 16,678 Updated Mar 26, 2026

A Visual Studio Code extension for building and debugging CUDA applications.

TypeScript 102 17 Updated Feb 14, 2026

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,241 132 Updated Mar 26, 2026

CUDA Python: Performance meets Productivity

Cython 3,195 263 Updated Mar 26, 2026

FlagCX is a scalable and adaptive cross-chip communication library.

C++ 183 52 Updated Mar 26, 2026
Python 155 24 Updated Oct 9, 2024

Pipeline Parallelism Emulation and Visualization

Python 81 9 Updated Jan 8, 2026

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Cuda 1,075 163 Updated Mar 26, 2026

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

Cuda 74 3 Updated May 9, 2025

Perplexity GPU Kernels

C++ 565 81 Updated Nov 7, 2025

official implementation of paper SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

Python 44 8 Updated Dec 11, 2024

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,972 288 Updated May 15, 2025
Next