Skip to content
View wangfakang's full-sized avatar

Organizations

@envoyproxy

Block or report wangfakang

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,643 83 Updated Dec 20, 2025

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 8,995 1,588 Updated Dec 21, 2025

SGLang kernel library for NPU

C++ 86 61 Updated Dec 20, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 4,268 350 Updated Dec 21, 2025

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 75 6 Updated Dec 19, 2025

extensible collectives library in triton

Python 91 6 Updated Mar 31, 2025

Venus Collective Communication Library, supported by SII and Infrawaves.

C++ 125 5 Updated Dec 18, 2025

The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.

LLVM 36,029 15,553 Updated Dec 21, 2025

A Visual Studio Code extension for building and debugging CUDA applications.

TypeScript 94 17 Updated Dec 4, 2025

UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)

C++ 1,132 105 Updated Dec 21, 2025

CUDA Python: Performance meets Productivity

Cython 3,100 234 Updated Dec 19, 2025
C++ 136 43 Updated Dec 19, 2025
Python 144 19 Updated Oct 9, 2024

Pipeline Parallelism Emulation and Visualization

Python 74 7 Updated Jun 12, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 950 130 Updated Dec 21, 2025

Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport

Cuda 72 3 Updated May 9, 2025

Perplexity GPU Kernels

C++ 540 74 Updated Nov 7, 2025

official implementation of paper SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training

Python 42 8 Updated Dec 11, 2024

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,948 288 Updated May 15, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,820 1,034 Updated Dec 5, 2025

Morpheus SDK

Python 555 195 Updated Sep 29, 2025

This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".

Python 1,015 105 Updated Sep 29, 2022

Fast OS-level support for GPU checkpoint and restore

C++ 263 28 Updated Sep 28, 2025
HTML 227 47 Updated Dec 5, 2025

CUDA checkpoint and restore utility

C 398 25 Updated Sep 15, 2025

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,438 1,970 Updated Dec 21, 2025

prime is a framework for efficient, globally distributed training of AI models over the internet.

Python 848 94 Updated Nov 16, 2025
42 5 Updated Nov 5, 2024

tee-like program that tee-s stdin to a rotated log file(s) and can compress them.

C++ 15 5 Updated Jan 28, 2018

NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…

Python 239 39 Updated Dec 21, 2025
Next