Stars
xiaolong-li1 / VIDEO-BLADE
Forked from ziplab/BLADEThis is the official PyTorch implementation of "Video-BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation."
This is the official PyTorch implementation of "BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation."
A project implementing various agentic RL based on the Slime post-training framework
TriAttention — Efficient long reasoning with trigonometric KV cache compression. Enables OpenClaw local deployment on memory-constrained GPUs.
[CVPRW 2026 Oral] Less Detail, Better Answers: Degradation-Driven Prompting for VQA
A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-performance systems.
PTX ISA 9.1 documentation converted to searchable markdown. Includes Claude Code skill for CUDA development.
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
🔥 LeetCode for PyTorch — practice implementing softmax, attention, GPT-2 and more from scratch with instant auto-grading. Jupyter-based, self-hosted or try online.
LM engine is a library for pretraining/finetuning LLMs
A Model Context Protocol (MCP) server for creating, reading, and manipulating Microsoft Word documents. This server enables AI assistants to work with Word documents through a standardized interfac…
Official repository of paper [FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic, NeurIPS 2025]
Artifact for PPoPP'26 "RoMeo: Mitigating Dual-dimensional Outliers with Rotated Mixed Precision Quantization"
flash attention tutorial written in python, triton, cuda, cutlass
DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
A library of GPU kernels for sparse matrix operations.
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
Helpful kernel tutorials and examples for tile-based GPU programming
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
Accelerating MoE with IO and Tile-aware Optimizations
[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter
Trainable fast and memory-efficient sparse attention
🚀🚀 Efficient implementations of Native Sparse Attention
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.