Skip to content
View wangfakang's full-sized avatar

Organizations

@envoyproxy

Block or report wangfakang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Enabling PyTorch on XLA Devices (e.g. Google TPU)

C++ 2,771 570 Updated Dec 18, 2025
Python 8 Updated Apr 3, 2026

A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future …

TypeScript 61,281 5,078 Updated Apr 17, 2026

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

85 8 Updated Mar 14, 2026

Helpful kernel tutorials and examples for tile-based GPU programming

Python 702 65 Updated Apr 17, 2026

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 95 7 Updated Jan 16, 2026

how to optimize some algorithm in cuda.

Cuda 2,928 270 Updated Apr 16, 2026

AI for All: The First Systematic Vibe Coding Tutorial | From Zero to Full-Stack, Bring Your Ideas to Life | Live at: www.vibevibe.cn ;全民AI学习第一课,首个系统化 Vibe Coding 开源教程 | 零基础到全栈实战,让人人都能借助 AI 实现自己的想法与…

Dockerfile 4,603 374 Updated Mar 18, 2026

Here are my personal paper reading notes (including machine learning systems, AI infrastructure, and other interesting stuffs).

JavaScript 180 10 Updated Apr 12, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,408 138 Updated Apr 10, 2026

An open-source AI agent that lives in your terminal.

TypeScript 23,450 2,223 Updated Apr 17, 2026

Accelerating MoE with IO and Tile-aware Optimizations

Python 635 73 Updated Apr 15, 2026

Low overhead tracing library and trace visualizer for pipelined CUDA kernels

C 136 6 Updated Nov 26, 2025

High-performance distributed data shuffling (all-to-all) library for MoE training and inference

Python 118 11 Updated Mar 7, 2026

Nex Venus Communication Library

C++ 74 7 Updated Nov 17, 2025

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 2,023 131 Updated Apr 17, 2026

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,595 1,796 Updated Apr 17, 2026

SGLang kernel library for NPU

C++ 125 111 Updated Apr 16, 2026

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,497 508 Updated Apr 17, 2026

extensible collectives library in triton

Python 98 6 Updated Mar 31, 2025

Venus Collective Communication Library, supported by SII and Infrawaves.

C++ 142 7 Updated Apr 13, 2026

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 37,912 16,883 Updated Apr 17, 2026

A Visual Studio Code extension for building and debugging CUDA applications.

TypeScript 103 17 Updated Feb 14, 2026

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,305 137 Updated Apr 17, 2026

CUDA Python: Performance meets Productivity

Cython 3,220 273 Updated Apr 17, 2026

FlagCX is a scalable and adaptive cross-chip communication library.

C++ 184 53 Updated Apr 17, 2026
Python 158 24 Updated Oct 9, 2024

Pipeline Parallelism Emulation and Visualization

Python 81 9 Updated Jan 8, 2026

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Cuda 1,089 173 Updated Apr 17, 2026

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

Cuda 74 3 Updated May 9, 2025
Next