Skip to content
View wangfakang's full-sized avatar

Organizations

@envoyproxy

Block or report wangfakang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Python 8 Updated Apr 3, 2026

A Claude Code plugin that automatically captures everything Claude does during your coding sessions, compresses it with AI (using Claude's agent-sdk), and injects relevant context back into future …

TypeScript 46,682 3,593 Updated Apr 9, 2026

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

79 7 Updated Mar 14, 2026

Helpful kernel tutorials and examples for tile-based GPU programming

Python 695 59 Updated Apr 9, 2026

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 95 7 Updated Jan 16, 2026

how to optimize some algorithm in cuda.

Cuda 2,910 267 Updated Apr 9, 2026

AI for All: The First Systematic Vibe Coding Tutorial | From Zero to Full-Stack, Bring Your Ideas to Life | Live at: www.vibevibe.cn ;全民AI学习第一课,首个系统化 Vibe Coding 开源教程 | 零基础到全栈实战,让人人都能借助 AI 实现自己的想法与…

Dockerfile 4,458 363 Updated Mar 18, 2026

Here are my personal paper reading notes (including machine learning systems, AI infrastructure, and other interesting stuffs).

JavaScript 175 10 Updated Apr 9, 2026

Distributed Compiler based on Triton for Parallel Systems

Python 1,403 136 Updated Mar 11, 2026

An open-source AI agent that lives in your terminal.

TypeScript 22,239 2,058 Updated Apr 9, 2026

Accelerating MoE with IO and Tile-aware Optimizations

Python 625 68 Updated Apr 1, 2026

Low overhead tracing library and trace visualizer for pipelined CUDA kernels

C 136 6 Updated Nov 26, 2025

High-performance distributed data shuffling (all-to-all) library for MoE training and inference

Python 117 11 Updated Mar 7, 2026

Nex Venus Communication Library

C++ 73 7 Updated Nov 17, 2025

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 2,013 130 Updated Apr 4, 2026

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,552 1,779 Updated Apr 9, 2026

SGLang kernel library for NPU

C++ 114 105 Updated Apr 9, 2026

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 5,475 499 Updated Apr 8, 2026

extensible collectives library in triton

Python 98 6 Updated Mar 31, 2025

Venus Collective Communication Library, supported by SII and Infrawaves.

C++ 141 7 Updated Mar 4, 2026

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 37,785 16,812 Updated Apr 9, 2026

A Visual Studio Code extension for building and debugging CUDA applications.

TypeScript 104 17 Updated Feb 14, 2026

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,279 133 Updated Apr 9, 2026

CUDA Python: Performance meets Productivity

Cython 3,213 267 Updated Apr 9, 2026

FlagCX is a scalable and adaptive cross-chip communication library.

C++ 184 53 Updated Apr 9, 2026
Python 156 24 Updated Oct 9, 2024

Pipeline Parallelism Emulation and Visualization

Python 81 9 Updated Jan 8, 2026

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

Cuda 1,081 167 Updated Apr 9, 2026

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

Cuda 74 3 Updated May 9, 2025

Perplexity GPU Kernels

C++ 565 83 Updated Nov 7, 2025
Next