Skip to content
View HandH1998's full-sized avatar

Block or report HandH1998

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Accelerating MoE with IO and Tile-aware Optimizations

Python 365 17 Updated Dec 18, 2025
Python 1,591 117 Updated Dec 18, 2025

SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention

Python 176 8 Updated Dec 16, 2025

TurboDiffusion: 100–200× Acceleration for Video Diffusion Models

Python 762 24 Updated Dec 20, 2025

Tile-Based Runtime for Ultra-Low-Latency LLM Inference

Python 451 19 Updated Dec 8, 2025

a size profiler for cuda binary

Python 69 Updated Oct 7, 2025

NVIDIA cuTile learn

Python 130 Updated Dec 9, 2025

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,632 83 Updated Dec 20, 2025

Helpful kernel tutorials and examples for tile-based GPU programming

Python 456 23 Updated Dec 19, 2025

Code for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”

Python 82 1 Updated Dec 20, 2025

[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Python 114 8 Updated Dec 5, 2025

Distributed MoE in a Single Kernel [NeurIPS '25]

Cuda 157 18 Updated Dec 18, 2025
Python 612 57 Updated Dec 19, 2025

Advanced quantization toolkit for LLMs and VLMs. Support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Schemes and seamless integration with Transformers, vLLM, SGLang, and llm-compressor

Python 772 64 Updated Dec 19, 2025
C++ 209 6 Updated Nov 19, 2025
Python 88 12 Updated Nov 16, 2025

Low overhead tracing library and trace visualizer for pipelined CUDA kernels

C 127 5 Updated Nov 26, 2025

GPU programming related news and material links

1,874 110 Updated Sep 17, 2025

QeRL enables RL for 32B LLMs on a single H100 GPU.

Python 468 44 Updated Nov 27, 2025

Triton-based Symmetric Memory operators and examples

Python 67 11 Updated Oct 17, 2025

A framework to compare low-bit integer and float-point formats

Python 50 5 Updated Nov 1, 2025

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Python 774 51 Updated Oct 15, 2025

Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.

Python 78 1 Updated Dec 13, 2025

NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process communication and coordination overheads by allowing programmer…

C++ 418 48 Updated Dec 20, 2025

QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

C++ 148 13 Updated Nov 11, 2025
Next