Skip to content
View dalistarh's full-sized avatar
  • IST Austria & Neural Magic

Block or report dalistarh

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Quantized LLM training in pure CUDA/C++.

C++ 214 14 Updated Nov 3, 2025

Efficient non-uniform quantization with GPTQ for GGUF

Python 52 4 Updated Sep 17, 2025
Python 59 10 Updated Nov 5, 2025

QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning

C++ 125 9 Updated Nov 4, 2025

Code for data-aware compression of DeepSeek models

Python 57 10 Updated Jun 10, 2025

An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs

Python 555 51 Updated Nov 2, 2025

Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train OpenAI gpt-oss, DeepSeek-R1, Qwen3, Gemma 3, TTS 2x faster with 70% less VRAM.

Python 47,906 3,916 Updated Nov 5, 2025

Work in progress.

Jupyter Notebook 75 6 Updated Jun 29, 2025

An Open Large Reasoning Model for Real-World Solutions

Python 1,524 80 Updated May 30, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 62,088 11,037 Updated Nov 5, 2025

List of (mostly ML) papers, where description of the method could be shortened significantly

5 Updated Nov 9, 2024

Fast Matrix Multiplications for Lookup Table-Quantized LLMs

C++ 375 18 Updated Apr 13, 2025

Code for the EMNLP 2024 paper "Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on LLMs".

Python 8 Updated Jun 18, 2024

Efficient Triton Kernels for LLM Training

Python 5,800 426 Updated Nov 5, 2025

Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM

Python 2,194 276 Updated Nov 4, 2025

A fast inference library for running LLMs locally on modern consumer-class GPUs

Python 4,355 324 Updated Aug 16, 2025

Vector Approximate Message Passing inference framework for GWAS

C++ 16 4 Updated Jul 16, 2025
Python 296 19 Updated Apr 8, 2025

Official implementation of the ICML 2024 paper RoSA (Robust Adaptation)

Python 44 6 Updated Feb 13, 2024
Python 564 49 Updated Oct 29, 2024

Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".

Python 277 23 Updated Nov 3, 2023

[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Python 631 56 Updated Mar 4, 2024

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,984 525 Updated Apr 11, 2025

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Python 8,784 572 Updated May 3, 2024

4 bits quantization of LLaMA using GPTQ

Python 3,076 461 Updated Jul 13, 2024
C++ 15 5 Updated Sep 27, 2023

A collection of libraries to optimise AI model performances

Python 8,366 632 Updated Jul 22, 2024

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 2,213 182 Updated Mar 27, 2024

Code for ICML 2022 paper "SPDY: Accurate Pruning with Speedup Guarantees"

Python 20 4 Updated May 3, 2023

Pytorch distributed backend extension with compression support

C++ 16 Updated Mar 24, 2025
Next