Skip to content
View sunggg's full-sized avatar

Block or report sunggg

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models (ICLR 2024)

Python 14 Updated May 31, 2025

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,941 1,919 Updated Jun 23, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 29,552 6,672 Updated Jun 23, 2026

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Python 18,567 1,510 Updated May 24, 2026

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 2,106 191 Updated Jun 30, 2025

Benchmarking suite for popular AI APIs

Python 89 15 Updated Feb 6, 2025

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Python 1,914 124 Updated Jan 21, 2024

FlashInfer: Kernel Library for LLM Serving

Python 5,841 1,070 Updated Jun 23, 2026

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,232 398 Updated Jul 11, 2024

A fast inference library for running LLMs locally on modern consumer-class GPUs

Python 4,562 338 Updated Mar 4, 2026

QLoRA: Efficient Finetuning of Quantized LLMs

Jupyter Notebook 10,937 874 Updated Jun 10, 2024
Python 122 13 Updated Apr 22, 2024

Fast and memory-efficient exact attention

Python 24,221 2,851 Updated Jun 22, 2026

4 bits quantization of LLaMA using GPTQ

Python 3,072 451 Updated Jul 13, 2024

LLM inference in C/C++

C++ 117,770 19,841 Updated Jun 23, 2026

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Python 2,924 222 Updated Sep 30, 2023

High-performance In-browser LLM Inference Engine

TypeScript 18,254 1,312 Updated Jun 9, 2026
Python 176 104 Updated Jun 14, 2026

Temp repo for prototyping relax(relay next), the effort will be upstreamed. We use the wiki pages on this repo to host design docs.

Python 5 Updated Mar 6, 2023

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 2…

C 950 161 Updated Nov 27, 2024
C++ 144 21 Updated Jan 30, 2025

An Open Source Machine Learning Framework for Everyone

C++ 195,844 75,185 Updated Jun 23, 2026

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

C++ 1,002 167 Updated Sep 19, 2024
Python 394 115 Updated Nov 4, 2022

The Tensor Algebra SuperOptimizer for Deep Learning

C++ 743 93 Updated Jan 26, 2023

Development repository for the Triton language and compiler

MLIR 19,506 2,957 Updated Jun 23, 2026
C++ 9 3 Updated Dec 18, 2021

Open single and half precision gemm implementations

C 397 87 Updated Apr 2, 2023
Next