Skip to content
View Hygge02's full-sized avatar

Block or report Hygge02

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference

Python 600 36 Updated Nov 24, 2025

A library of GPU kernels for sparse matrix operations.

C++ 283 53 Updated Nov 24, 2020

SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs

Cuda 61 16 Updated Mar 25, 2025

Helpful kernel tutorials and examples for tile-based GPU programming

Python 625 43 Updated Feb 2, 2026

⚑FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)

Python 3,302 284 Updated Nov 26, 2025

Accelerating MoE with IO and Tile-aware Optimizations

Python 567 51 Updated Jan 19, 2026

[ASPLOS'26] Taming the Long-Tail: Efficient Reasoning RL Training with Adaptive Drafter

Python 131 12 Updated Dec 5, 2025
C++ 221 7 Updated Nov 19, 2025

The official repository for PTQTP implementation

10 Updated Sep 24, 2025

Trainable fast and memory-efficient sparse attention

Python 525 49 Updated Feb 1, 2026

πŸš€πŸš€ Efficient implementations of Native Sparse Attention

Python 1,042 12 Updated Sep 29, 2025
Python 164 12 Updated Jul 22, 2024

PyTorch bindings for CUTLASS grouped GEMM.

Cuda 184 48 Updated Dec 16, 2025

A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of vLLM).

Python 313 34 Updated Jun 10, 2025

Official PyTorch Implementation of "Diffusion Transformers with Representation Autoencoders"

Python 1,743 64 Updated Jan 20, 2026

QeRL enables RL for 32B LLMs on a single H100 GPU.

Python 480 48 Updated Nov 27, 2025

StreamingVLM: Real-Time Understanding for Infinite Video Streams

Python 865 57 Updated Oct 15, 2025

A framework for serving and evaluating LLM routers - save LLM costs without compromising quality

Python 4,575 360 Updated Aug 10, 2024

Modeling, training, eval, and inference code for OLMo

Python 6,301 701 Updated Nov 24, 2025

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Python 39 3 Updated Oct 16, 2025
Jupyter Notebook 118 11 Updated Jan 8, 2026

[CVPR 2025 Oral] Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Python 1,385 51 Updated Dec 16, 2025

Quantized LLM training in pure CUDA/C++.

C++ 237 14 Updated Jan 20, 2026

A lightweight Inference Engine built for block diffusion models

Python 40 5 Updated Dec 9, 2025

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Python 193 6 Updated Jan 23, 2026

Implementation for FP8/INT8 Rollout for RL training without performence drop.

Python 288 19 Updated Nov 7, 2025

Paper reading and discussion notes, covering AI frameworks, distributed systems, cluster management, etc.

53 1 Updated Nov 11, 2025

Research prototype of PRISM β€” a cost-efficient multi-LLM serving system with flexible time- and space-based GPU sharing.

Python 57 2 Updated Aug 15, 2025
Python 12 6 Updated Jan 21, 2026
Next