Skip to content
View sunggg's full-sized avatar

Block or report sunggg

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models (ICLR 2024)

Python 14 Updated May 31, 2025

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,935 1,918 Updated Jun 21, 2026

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 29,514 6,647 Updated Jun 22, 2026

tiktoken is a fast BPE tokeniser for use with OpenAI's models.

Python 18,554 1,511 Updated May 24, 2026

MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.

Python 2,106 191 Updated Jun 30, 2025

Benchmarking suite for popular AI APIs

Python 89 15 Updated Feb 6, 2025

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

Python 1,913 124 Updated Jan 21, 2024

FlashInfer: Kernel Library for LLM Serving

Python 5,835 1,066 Updated Jun 22, 2026

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 7,232 398 Updated Jul 11, 2024

A fast inference library for running LLMs locally on modern consumer-class GPUs

Python 4,560 338 Updated Mar 4, 2026

QLoRA: Efficient Finetuning of Quantized LLMs

Jupyter Notebook 10,936 874 Updated Jun 10, 2024
Python 122 13 Updated Apr 22, 2024

Fast and memory-efficient exact attention

Python 24,208 2,850 Updated Jun 20, 2026

4 bits quantization of LLaMA using GPTQ

Python 3,072 451 Updated Jul 13, 2024

LLM inference in C/C++

C++ 117,617 19,803 Updated Jun 22, 2026

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.

Python 2,924 222 Updated Sep 30, 2023

High-performance In-browser LLM Inference Engine

TypeScript 18,250 1,312 Updated Jun 9, 2026
Python 176 104 Updated Jun 14, 2026

Temp repo for prototyping relax(relay next), the effort will be upstreamed. We use the wiki pages on this repo to host design docs.

Python 5 Updated Mar 6, 2023

[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 2…

C 950 161 Updated Nov 27, 2024
C++ 144 21 Updated Jan 30, 2025

An Open Source Machine Learning Framework for Everyone

C++ 195,816 75,196 Updated Jun 22, 2026

A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.

C++ 1,002 167 Updated Sep 19, 2024
Python 394 115 Updated Nov 4, 2022

The Tensor Algebra SuperOptimizer for Deep Learning

C++ 743 93 Updated Jan 26, 2023

Development repository for the Triton language and compiler

MLIR 19,494 2,952 Updated Jun 21, 2026
C++ 9 3 Updated Dec 18, 2021

Open single and half precision gemm implementations

C 397 87 Updated Apr 2, 2023
Next