jason-huang03

Follow

Haofeng Huang jason-huang03

Follow

Undergraduate student from IIIS (Yao Class), Tsinghua University | Training Framework | Kernel | Model Arch & GenAI

182 followers · 34 following

Tsinghua University
Beijing, China

Achievements

Achievements

Organizations

Stars

modal-labs / gpu-glossary

GPU documentation for humans

Python 399 48 Updated Nov 10, 2025

xlite-dev / ffpa-attn

🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.

Cuda 228 10 Updated Aug 8, 2025

meta-pytorch / torchcomms

torchcomms: a modern PyTorch communications API

C++ 276 31 Updated Nov 13, 2025

XunhaoLai / native-sparse-attention-triton

Efficient triton implementation of Native Sparse Attention.

Python 247 18 Updated May 23, 2025

Relaxed-System-Lab / Flash-Sparse-Attention

🚀🚀 Efficient implementations of Native Sparse Attention

Python 1,020 8 Updated Sep 29, 2025

ModelTC / LightCompress

[EMNLP 2024 & AAAI 2026] A powerful toolkit for compressing large models including LLM, VLM, and video generation models.

Python 619 63 Updated Nov 10, 2025

fal-ai / flashpack

High-throughput tensor loading for PyTorch

Python 195 12 Updated Oct 27, 2025

triton-lang / triton

Development repository for the Triton language and compiler

MLIR 17,538 2,385 Updated Nov 13, 2025

MoonshotAI / Kimi-Linear

1,157 50 Updated Oct 31, 2025

frederic-santos / taoanalysissolutions

Propositions of solutions to the exercises from Terence Tao's textbooks, Analysis I & II. Mirrored from https://gitlab.com/f-santos/taoanalysissolutions

TeX 97 11 Updated Jan 17, 2023

NVIDIA / compute-eval

Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Large Language Models.

Python 74 14 Updated Oct 1, 2025

KuangjuX / NVSHMEM-Tutorial

NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer

Cuda 142 11 Updated Sep 18, 2025

microsoft / mscclpp

MSCCL++: A GPU-driven communication stack for scalable AI applications

C++ 434 72 Updated Nov 13, 2025

THUDM / slime

slime is an LLM post-training framework for RL Scaling.

Python 2,461 251 Updated Nov 13, 2025

yifan123 / flow_grpo

[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL

Python 1,575 93 Updated Nov 4, 2025

NVIDIA / tilus

Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.

Python 398 9 Updated Nov 8, 2025

ModelTC / LightX2V

Light Video Generation Inference Framework

Python 792 50 Updated Nov 13, 2025

ModelTC / Qwen-Image-Lightning

Qwen-Image-Lightning: Speed up Qwen-Image model with distillation

Python 943 38 Updated Oct 14, 2025

NVIDIA / nvbench

CUDA Kernel Benchmarking Library

Cuda 762 90 Updated Oct 21, 2025

ByteDance-Seed / cudaLLM

Python 120 6 Updated Aug 18, 2025

qingkelab / qingketalk

青稞Talk

160 1 Updated Nov 13, 2025

KEKE046 / mlir-tutorial

Hands-On Practical MLIR Tutorial

C++ 651 95 Updated Oct 20, 2023

antgroup / DeepXTrace

DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.

Python 66 4 Updated Nov 5, 2025

stepfun-ai / StepMesh

C++ 316 29 Updated Nov 13, 2025

simveit / cute_persistent_kernels

Python 9 Updated Jul 25, 2025

Lightricks / LTX-Video-Q8-Kernels

Python 71 14 Updated May 14, 2025

lxa9867 / Awesome-Autoregressive-Visual-Generation

This is a repo to track the latest autoregressive visual generation papers.

409 5 Updated Jun 25, 2025

nelvko / clash-for-linux-install

😼 优雅地使用基于 clash/mihomo 的代理环境

Shell 5,801 738 Updated Nov 11, 2025

vosen / ZLUDA

CUDA on non-NVIDIA GPUs

Rust 13,453 849 Updated Nov 13, 2025

star-history / star-history

The missing star history graph of GitHub repos - https://star-history.com

TypeScript 8,056 302 Updated Nov 12, 2025