Skip to content
View n1mb0606's full-sized avatar

Block or report n1mb0606

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Large Language Model (LLM) Systems Paper List

1,987 102 Updated May 16, 2026

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

C++ 587 206 Updated Apr 25, 2026

LLMServingSim 2.0: A Unified Simulator for Heterogeneous and Disaggregated LLM Serving Infrastructure

Python 278 72 Updated May 11, 2026

SGLang is a fast serving framework for large language models and vision language models.

Python 7 3 Updated Dec 15, 2025

Supercharge Your LLM with the Fastest KV Cache Layer

Python 8,301 1,181 Updated May 19, 2026

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Python 2,015 315 Updated Mar 30, 2026

Achieve state of the art inference performance with modern accelerators on Kubernetes

Shell 3,214 483 Updated May 16, 2026

❤️ 1000+ Hand-Crafted Go Examples, Exercises, and Quizzes. 🚀 Learn Go by fixing 1000+ tiny programs.

Go 20,022 2,718 Updated Jun 24, 2025

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Python 2,179 217 Updated Oct 8, 2024

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 6,260 572 Updated May 19, 2026

A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems

Python 259 15 Updated Mar 19, 2026

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs

C++ 729 95 Updated Apr 21, 2026

Unsloth Studio is a web UI for training and running open models like Gemma 4, Qwen3.6, DeepSeek, gpt-oss locally.

Python 64,637 5,711 Updated May 19, 2026

Running large language models on a single GPU for throughput-oriented scenarios.

Python 9,369 590 Updated Oct 28, 2024
Jupyter Notebook 91 11 Updated Oct 17, 2025

Text-audio foundation model from Boson AI

Python 8,074 622 Updated Jan 18, 2026
Python 244 32 Updated Nov 9, 2022

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 28,002 5,999 Updated May 19, 2026

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 10,462 773 Updated Apr 21, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 80,460 16,955 Updated May 19, 2026

Reference implementations of MLPerf® inference benchmarks

Python 1,567 625 Updated May 14, 2026

FlashInfer: Kernel Library for LLM Serving

Python 5,636 981 Updated May 19, 2026

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 1,140 113 Updated Dec 30, 2024

Fast and memory-efficient exact attention

Python 23,839 2,744 Updated May 16, 2026

Rust Learning Resources

2,012 198 Updated Apr 6, 2025

A Datacenter Scale Distributed Inference Serving Framework

Rust 6,811 1,114 Updated May 19, 2026

Fast CUDA matrix multiplication from scratch

Cuda 1,187 186 Updated Sep 2, 2025

GoogleTest - Google Testing and Mocking Framework

C++ 38,622 10,777 Updated May 15, 2026

CUDA Matrix Multiplication Optimization

Cuda 271 26 Updated Jul 19, 2024

LLM inference in C/C++

C++ 111,307 18,415 Updated May 19, 2026
Next