-
PaddlePaddle, Baidu
- Beijing
-
04:47
(UTC +08:00)
Lists (1)
Sort Name ascending (A-Z)
Stars
Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference
PyTorch native quantization and sparsity for training and inference
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Disaggregated serving system for Large Language Models (LLMs).
High performance Transformer implementation in C++.
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Development repository for the Triton language and compiler
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filli…
[ACL 2024] Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
This is the repo for 6000D(Graph Processing and Analytics) final proj of HKUST-GZ
Unified KV Cache Compression Methods for Auto-Regressive Models
Fast Hadamard transform in CUDA, with a PyTorch interface
Examples of CUDA implementations by Cutlass CuTe
A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
A Easy-to-understand TensorOp Matmul Tutorial