[ArXiv 2025] A curated list of papers on on-device large language models, focusing on model compression and system optimization techniques from the survey "On-Device Large Language Models: A Survey…

21 2 Updated Jan 27, 2026

deepseek-ai / Engram

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

Python 3,666 254 Updated Jan 14, 2026

NVIDIA / cutile-python

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

Python 1,926 114 Updated Feb 12, 2026

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 3,139 237 Updated Feb 10, 2026

ColfaxResearch / cutlass-kernels

Cuda 261 38 Updated Jul 11, 2024

z-lab / dflash

DFlash: Block Diffusion for Flash Speculative Decoding

Python 543 35 Updated Feb 6, 2026

radixark / miles

Miles is an enterprise-facing reinforcement learning framework for LLM and VLM post-training, forked from and co-evolving with slime.

Python 873 108 Updated Feb 13, 2026

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 6,172 819 Updated Feb 3, 2026

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,980 1,096 Updated Feb 9, 2026

karpathy / nanoGPT

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 53,065 8,983 Updated Nov 12, 2025

IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,018 86 Updated Sep 4, 2024

verl-project / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 19,192 3,239 Updated Feb 13, 2026

microsoft / BitNet

Official inference framework for 1-bit LLMs

Python 28,408 2,328 Updated Feb 3, 2026

Sea-Labs-ai / Sea-benchmarks-public

This repo release the detailed benchmark code and results of Sea Labs AI.

Python 13 1 Updated Jan 3, 2026

gpusgobrr / explore-gemm

Exploring how optimizations for GEMMs work

Python 25 3 Updated Jan 1, 2026

snowflakedb / ArcticInference

ArcticInference: vLLM plugin for high-throughput, low-latency inference

Python 395 48 Updated Feb 10, 2026

papers-we-love / papers-we-love

Papers from the computer science community to read and discuss.

Shell 103,247 6,269 Updated Oct 10, 2025

XTLS / REALITY

THE NEXT FUTURE

Go 4,761 397 Updated Dec 27, 2025

GeeeekExplorer / nano-vllm

Nano vLLM

Python 11,668 1,568 Updated Nov 3, 2025

Dao-AILab / quack

A Quirky Assortment of CuTe Kernels

Python 798 80 Updated Feb 11, 2026

Infrasys-AI / AIInfra

AIInfra（AI 基础设施）指AI系统从底层芯片等硬件，到上层软件栈支持AI大模型训练和推理。

Jupyter Notebook 6,069 827 Updated Dec 22, 2025

robertwgh / cuLDPC

CUDA implementation of LDPC decoding algorithm

C++ 42 12 Updated Dec 16, 2020

jeffffffli / CrowdPose

CrowdPose: Efficient Crowded Scenes Pose Estimation and A New Benchmark, CVPR 2019, Oral

Python 332 39 Updated Jan 26, 2023

Sliverfish shenjy0829

Highlights

Lists (15)

🧠Awesome lists

Compressed Sensing Model

EM-NeRF

Lesson

Paper

Paper - Channel Estimation

Paper - INFOCOM

Paper - MLSys

Paper - Mobicom

paper - nips

Paper - NSDI

Paper - OSDI

Paper - SIGCOMM

🗺RoadMap

Tools

Starred repositories

Awesome Lists