Skip to content
View ymwangg's full-sized avatar

Block or report ymwangg

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.

Python 1 Updated Jun 11, 2026

NKIPy: Rapid Prototyping on Trainium

Python 28 9 Updated Jun 19, 2026
Python 64 7 Updated Jun 2, 2026

A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.

Python 890 155 Updated Jun 22, 2026
Python 17 6 Updated Jun 19, 2026

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

Rust 10,835 1,125 Updated Jun 22, 2026

Every front-end GUI client for ChatGPT, Claude, and other LLMs

3,985 274 Updated Jan 22, 2026

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

Python 5,351 410 Updated Apr 20, 2026

PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices

Python 40 28 Updated Jan 31, 2024

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,939 1,919 Updated Jun 21, 2026

A unified library of SOTA model optimization techniques like quantization, distillation, pruning, neural architecture search, speculative decoding, etc. It compresses deep learning models for downs…

Python 2,964 453 Updated Jun 22, 2026

This repository contains integer operators on GPUs for PyTorch.

Python 235 55 Updated Sep 29, 2023

Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.

Python 10,583 1,071 Updated Jul 1, 2024

This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & V…

1,293 72 Updated Mar 9, 2025

The official Meta Llama 3 GitHub site

Python 29,291 3,531 Updated Jan 26, 2025

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.

Python 1,092 88 Updated Sep 4, 2024

A framework for few-shot evaluation of language models.

Python 13,024 3,353 Updated Jun 22, 2026

📰 Must-read papers and blogs on Speculative Decoding ⚡️

1,259 80 Updated Jun 2, 2026

FlashInfer: Kernel Library for LLM Serving

Python 5,838 1,068 Updated Jun 22, 2026

Measuring Massive Multitask Language Understanding | ICLR 2021

Python 1,589 117 Updated May 28, 2023

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 2,342 303 Updated May 11, 2025

SGLang is a high-performance serving framework for large language models and multimodal models.

Python 29,529 6,657 Updated Jun 22, 2026

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Python 8,994 624 Updated May 3, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 3,568 316 Updated Jul 17, 2025

Numbers every LLM developer should know

4,312 140 Updated Jan 16, 2024

The simplest, fastest repository for training/finetuning medium-sized GPTs.

Python 60,007 10,338 Updated Nov 12, 2025

Fast and memory-efficient exact attention

Python 24,210 2,851 Updated Jun 20, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 83,557 18,320 Updated Jun 22, 2026

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset

7,528 405 Updated Jul 16, 2023

Python pdb for multiple processes

Python 82 9 Updated May 24, 2025
Next