- Beijing
-
18:16
(UTC +08:00) - https://scholar.google.com/citations?hl=zh-CN&user=MBR97ZIAAAAJ
Stars
Accelerating MoE with IO and Tile-aware Optimizations
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
Helpful kernel tutorials and examples for tile-based GPU programming
Code for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”
[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Distributed MoE in a Single Kernel [NeurIPS '25]
Advanced quantization toolkit for LLMs and VLMs. Support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Schemes and seamless integration with Transformers, vLLM, SGLang, and llm-compressor
Low overhead tracing library and trace visualizer for pipelined CUDA kernels
GPU programming related news and material links
Triton-based Symmetric Memory operators and examples
A framework to compare low-bit integer and float-point formats
StreamingVLM: Real-Time Understanding for Infinite Video Streams
Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning