Skip to content
View hq-ansel's full-sized avatar

Block or report hq-ansel

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

From Minimal GEMM to Everything

Cuda 86 4 Updated Nov 11, 2025

AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术

Jupyter Notebook 15,895 2,275 Updated Sep 3, 2025

fmchisel: Efficient Compression and Training Algorithms for Foundation Models

Python 81 9 Updated Oct 23, 2025

NVIDIA Inference Xfer Library (NIXL)

C++ 774 209 Updated Dec 21, 2025

TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…

Python 12,442 1,970 Updated Dec 22, 2025

VeOmni: Scaling Any Modality Model Training with Model-Centric Distributed Recipe Zoo

Python 1,441 121 Updated Dec 20, 2025

A modern GUI client based on Tauri, designed to run in Windows, macOS and Linux for tailored proxy experience

TypeScript 88,265 6,496 Updated Dec 22, 2025

[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training

Python 255 23 Updated Aug 9, 2025
Python 26 2 Updated Jan 7, 2025

A PyTorch native platform for training generative AI models

Python 4,861 647 Updated Dec 21, 2025

Using PyTorch autograd to compute Hessian of Perplexity for Large Language Models

Python 26 2 Updated Apr 17, 2025

LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU via HF, vLLM, and SGLang.

Python 939 141 Updated Dec 22, 2025

[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention

Python 49 4 Updated Aug 6, 2025

Official inference framework for 1-bit LLMs

Python 24,461 1,914 Updated Jun 3, 2025

Playing around "Less Slow" coding practices in C++ 20, C, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO

C++ 1,885 81 Updated Sep 10, 2025

Calculating the actual value of your job beyond just salary

TypeScript 2,974 186 Updated Dec 8, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 950 130 Updated Dec 22, 2025

My learning notes for ML SYS.

Python 4,738 300 Updated Dec 22, 2025

Codebase for the Progressive Mixed-Precision Decoding paper.

Python 19 Updated Jul 15, 2025
C++ 518 43 Updated Nov 17, 2025

[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Python 121 7 Updated Jul 4, 2025

[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs

Python 2 Updated Nov 20, 2024

Lightning fast C++/CUDA neural network framework

C++ 4,359 536 Updated Dec 14, 2025

Efficient Triton Kernels for LLM Training

Python 5,964 452 Updated Dec 21, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 4,319 611 Updated Dec 22, 2025
C++ 97 9 Updated Mar 26, 2025

Dynamic Memory Management for Serving LLMs without PagedAttention

C 449 34 Updated May 30, 2025

Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]

HTML 40 5 Updated May 13, 2025

Medusa: Accelerating Serverless LLM Inference with Materialization [ASPLOS'25]

HTML 12 1 Updated Nov 8, 2024

A highly optimized LLM inference acceleration engine for Llama and its variants.

C++ 906 102 Updated Jul 10, 2025
Next