Skip to content
View jason-huang03's full-sized avatar
  • Tsinghua University
  • Beijing, China

Organizations

@thu-nics @thu-ml

Block or report jason-huang03

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

GPU documentation for humans

Python 399 48 Updated Nov 10, 2025

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 228 10 Updated Aug 8, 2025

torchcomms: a modern PyTorch communications API

C++ 276 31 Updated Nov 13, 2025

Efficient triton implementation of Native Sparse Attention.

Python 247 18 Updated May 23, 2025

🚀🚀 Efficient implementations of Native Sparse Attention

Python 1,020 8 Updated Sep 29, 2025

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLM, VLM, and video generation models.

Python 619 63 Updated Nov 10, 2025

High-throughput tensor loading for PyTorch

Python 195 12 Updated Oct 27, 2025

Development repository for the Triton language and compiler

MLIR 17,538 2,385 Updated Nov 13, 2025

Propositions of solutions to the exercises from Terence Tao's textbooks, Analysis I & II. Mirrored from https://gitlab.com/f-santos/taoanalysissolutions

TeX 97 11 Updated Jan 17, 2023

Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Large Language Models.

Python 74 14 Updated Oct 1, 2025

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 142 11 Updated Sep 18, 2025

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 434 72 Updated Nov 13, 2025

slime is an LLM post-training framework for RL Scaling.

Python 2,461 251 Updated Nov 13, 2025

[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL

Python 1,575 93 Updated Nov 4, 2025

Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.

Python 398 9 Updated Nov 8, 2025

Light Video Generation Inference Framework

Python 792 50 Updated Nov 13, 2025

Qwen-Image-Lightning: Speed up Qwen-Image model with distillation

Python 943 38 Updated Oct 14, 2025

CUDA Kernel Benchmarking Library

Cuda 762 90 Updated Oct 21, 2025
Python 120 6 Updated Aug 18, 2025

青稞Talk

160 1 Updated Nov 13, 2025

Hands-On Practical MLIR Tutorial

C++ 651 95 Updated Oct 20, 2023

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 66 4 Updated Nov 5, 2025
C++ 316 29 Updated Nov 13, 2025
Python 9 Updated Jul 25, 2025

This is a repo to track the latest autoregressive visual generation papers.

409 5 Updated Jun 25, 2025

😼 优雅地使用基于 clash/mihomo 的代理环境

Shell 5,801 738 Updated Nov 11, 2025

CUDA on non-NVIDIA GPUs

Rust 13,453 849 Updated Nov 13, 2025

The missing star history graph of GitHub repos - https://star-history.com

TypeScript 8,056 302 Updated Nov 12, 2025
Next