jason-huang03

Follow

Haofeng Huang jason-huang03

Follow

Undergraduate student from IIIS (Yao Class), Tsinghua University | Training Framework | Kernel | Model Arch & GenAI

193 followers · 35 following

Tsinghua University
Beijing, China
https://jason-huang03.github.io/

Achievements

Achievements

Organizations

Stars

sgl-project / mini-sglang

Python 1,391 102 Updated Dec 18, 2025

Dao-AILab / sonic-moe

Accelerating MoE with IO and Tile-aware Optimizations

Python 330 14 Updated Dec 18, 2025

IST-DASLab / Sparse-Marlin

Boosting 4-bit inference kernels with 2:4 Sparsity

Cuda 89 5 Updated Sep 4, 2024

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 732 55 Updated Aug 6, 2025

foundation-model-stack / vllm-triton-backend

A Triton-only attention backend for vLLM

Python 23 7 Updated Dec 18, 2025

dsl-learn / cutile-learn

NVIDIA cuTile learn

Python 130 Updated Dec 9, 2025

vllm-project / llm-compressor

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Python 2,436 326 Updated Dec 19, 2025

NVIDIA / cutile-python

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,628 83 Updated Dec 19, 2025

NVIDIA / TileGym

Helpful kernel tutorials and examples for tile-based GPU programming

Python 455 22 Updated Dec 19, 2025

jirilebl / ra

Basic Analysis, undergraduate real analysis textbook

TeX 86 30 Updated Dec 11, 2025

deepseek-ai / DeepSeek-Math-V2

Python 1,488 117 Updated Dec 1, 2025

mcmcclur / marksmath

Quarto code for marksmath.org

JavaScript 1 Updated Dec 4, 2025

andrewheiss / ath-quarto

Code and content for andrewheiss.com

HTML 147 21 Updated Dec 3, 2025

mit-han-lab / flash-moba

C++ 209 6 Updated Nov 19, 2025

modal-labs / gpu-glossary

GPU documentation for humans

Python 416 51 Updated Dec 9, 2025

xlite-dev / ffpa-attn

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 242 12 Updated Nov 18, 2025

meta-pytorch / torchcomms

torchcomms: a modern PyTorch communications API

C++ 310 46 Updated Dec 19, 2025

XunhaoLai / native-sparse-attention-triton

Efficient triton implementation of Native Sparse Attention.

Python 255 18 Updated May 23, 2025

Relaxed-System-Lab / Flash-Sparse-Attention

🚀🚀 Efficient implementations of Native Sparse Attention

Python 1,042 12 Updated Sep 29, 2025

ModelTC / LightCompress

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLM, VLM, and video generation models.

Python 643 64 Updated Nov 19, 2025

fal-ai / flashpack

High-throughput tensor loading for PyTorch

Python 214 13 Updated Dec 3, 2025

triton-lang / triton

Development repository for the Triton language and compiler

MLIR 17,882 2,458 Updated Dec 19, 2025

MoonshotAI / Kimi-Linear

1,234 56 Updated Nov 17, 2025

frederic-santos / taoanalysissolutions

Propositions of solutions to the exercises from Terence Tao's textbooks, Analysis I & II. Mirrored from https://gitlab.com/f-santos/taoanalysissolutions

TeX 103 11 Updated Jan 17, 2023

NVIDIA / compute-eval

Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Large Language Models.

Python 86 15 Updated Nov 21, 2025

KuangjuX / NVSHMEM-Tutorial

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 146 13 Updated Sep 18, 2025

microsoft / mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 444 77 Updated Dec 19, 2025

THUDM / slime

slime is an LLM post-training framework for RL Scaling.

Python 2,909 347 Updated Dec 19, 2025

yifan123 / flow_grpo

[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL

Python 1,762 103 Updated Nov 4, 2025

NVIDIA / tilus

Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.

Python 431 14 Updated Dec 16, 2025