-
AntGroup
- HangZhou
-
05:02
(UTC +08:00) - http://wangfakang.github.io
Lists (3)
Sort Name ascending (A-Z)
Stars
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.
Venus Collective Communication Library, supported by SII and Infrawaves.
The LLVM Project is a collection of modular and reusable compiler and toolchain technologies.
A Visual Studio Code extension for building and debugging CUDA applications.
UCCL is an efficient communication library for GPUs, covering collectives, P2P (e.g., KV cache transfer, RL weight transfer), and EP (e.g., GPU-driven)
CUDA Python: Performance meets Productivity
Pipeline Parallelism Emulation and Visualization
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
official implementation of paper SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
DeepEP: an efficient expert-parallel communication library
This is an official implementation for "SimMIM: A Simple Framework for Masked Image Modeling".
Fast OS-level support for GPU checkpoint and restore
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
prime is a framework for efficient, globally distributed training of AI models over the internet.
tee-like program that tee-s stdin to a rotated log file(s) and can compress them.
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the effective training time by minimizing the downtime due to fa…