Skip to content
View xxyux's full-sized avatar
:shipit:
:shipit:
  • PaddlePaddle, Baidu
  • Beijing
  • 04:47 (UTC +08:00)

Block or report xxyux

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Python 4,452 340 Updated Jan 14, 2026
Python 359 45 Updated Apr 2, 2024

An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization

Python 175 17 Updated Nov 26, 2025

DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference

Python 638 37 Updated Nov 24, 2025

PyTorch native quantization and sparsity for training and inference

Python 2,856 527 Updated Jun 12, 2026
Python 3 Updated Jun 12, 2025

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 7,082 788 Updated Jun 12, 2026

Disaggregated serving system for Large Language Models (LLMs).

Jupyter Notebook 820 94 Updated Apr 6, 2025

High performance Transformer implementation in C++.

C++ 152 18 Updated Jan 18, 2025
Cuda 34 1 Updated Apr 2, 2025

[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.

C++ 93 11 Updated May 14, 2026
Python 166 16 Updated Feb 15, 2025

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 2,320 201 Updated Mar 27, 2024

A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations

Python 17,274 1,313 Updated Jun 7, 2026

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

8,002 287 Updated May 15, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,085 88 Updated Sep 4, 2024

Development repository for the Triton language and compiler

MLIR 19,434 2,937 Updated Jun 13, 2026

[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…

Python 1,221 78 Updated Apr 8, 2026

[ACL 2024] Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"

Python 45 1 Updated May 24, 2024

[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.

Python 182 18 Updated Apr 24, 2026

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 514 73 Updated Nov 26, 2024

This is the repo for 6000D(Graph Processing and Analytics) final proj of HKUST-GZ

Cuda 3 Updated Dec 14, 2023

Unified KV Cache Compression Methods for Auto-Regressive Models

Python 1,341 172 Updated Jan 4, 2025
Python 595 51 Updated Oct 29, 2024

Fast Hadamard transform in CUDA, with a PyTorch interface

C 327 63 Updated Mar 10, 2026

Examples of CUDA implementations by Cutlass CuTe

Makefile 278 34 Updated Jul 1, 2025

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Python 17,369 3,434 Updated Jun 13, 2026

The source code and script of CCF-THPC-

Python 2 Updated Feb 23, 2026

A Easy-to-understand TensorOp Matmul Tutorial

C++ 441 55 Updated Mar 5, 2026
Next