FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
-
Updated
Sep 4, 2024 - Python
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
QuantLLM is a Python library designed for developers, researchers, and teams who want to fine-tune and deploy large language models (LLMs) efficiently using 4-bit and 8-bit quantization techniques.
Supporting code for "LLMs for your iPhone: Whole-Tensor 4 Bit Quantization"
Mixed-precision quantization scheme (16/8/4bit mixed quantization) for the Wan2.2-Animate-14B model. Compresses the original 35GB base model to 17GB, balancing inference performance and model size.
a 4 bit TTL computer
ACE-Step 1.5 XL Optimized Fork: 4-Bit (INT4) Windows support + RTX 2080 Ti (Turing) stability fixes. Turing architecture compatible. Runs XL SFT on 11GB VRAM without OOM.
Add a description, image, and links to the 4bit topic page so that developers can more easily learn about it.
To associate your repository with the 4bit topic, visit your repo's landing page and select "manage topics."