Stars
rohan-gopalam / helion_nki
Forked from pytorch/helionA Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
Every front-end GUI client for ChatGPT, Claude, and other LLMs
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices
CUDA Templates and Python DSLs for High-Performance Linear Algebra
A unified library of SOTA model optimization techniques like quantization, distillation, pruning, neural architecture search, speculative decoding, etc. It compresses deep learning models for downs…
This repository contains integer operators on GPUs for PyTorch.
Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization.
This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitation and Distillation Algorithms, and explore the Skill & V…
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
A framework for few-shot evaluation of language models.
📰 Must-read papers and blogs on Speculative Decoding ⚡️
FlashInfer: Kernel Library for LLM Serving
Measuring Massive Multitask Language Understanding | ICLR 2021
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
SGLang is a high-performance serving framework for large language models and multimodal models.
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
The simplest, fastest repository for training/finetuning medium-sized GPTs.
Fast and memory-efficient exact attention
A high-throughput and memory-efficient inference and serving engine for LLMs
OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset