[NeurIPS 2023] LLM-Pruner: On the Structural Pruning of Large Language Models. Support Llama-3/3.1, Llama-2, LLaMA, BLOOM, Vicuna, Baichuan, TinyLlama, etc.

Python 1,105 130 Updated Oct 7, 2024

facebookresearch / LLM-QAT

Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"

Python 323 25 Updated Mar 4, 2025

bitsandbytes-foundation / bitsandbytes

Accessible large language models via k-bit quantization for PyTorch.

Python 7,935 818 Updated Jan 22, 2026

FMInference / FlexLLMGen

Running large language models on a single GPU for throughput-oriented scenarios.

Python 9,384 591 Updated Oct 28, 2024

facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,324 765 Updated Jan 26, 2026

karpathy / llama2.c

Inference Llama 2 in one file of pure C

C 19,151 2,442 Updated Aug 6, 2024

dust-tt / llama-ssp

Experiments on speculative sampling with Llama models

Python 128 8 Updated Jun 8, 2023

huggingface / text-generation-inference

Large Language Model Text Generation Inference

Python 10,752 1,254 Updated Jan 8, 2026

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on Hopper, Ada and Blackwell GPUs, to provide better performance…

Python 3,138 626 Updated Feb 4, 2026

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 22,092 2,348 Updated Feb 5, 2026

Shaunwei / RealChar

🎙️🤖Create, Customize and Talk to your AI Character/Companion in Realtime (All in One Codebase!). Have a natural seamless conversation with AI everywhere (mobile, web and terminal) using LLM OpenAI …

JavaScript 6,199 778 Updated Jan 20, 2026

open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 6,639 736 Updated Jan 22, 2026

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

Python 11,361 3,018 Updated Feb 3, 2026

openai / evals

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Python 17,645 2,871 Updated Nov 3, 2025

NVlabs / instant-ngp

Instant neural graphics primitives: lightning fast NeRF and more

Cuda 17,248 2,050 Updated Feb 2, 2026

mit-han-lab / smoothquant

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,600 195 Updated Jul 12, 2024

TM xutianming

Starred repositories

Deep learning

C++