Lists (1)
Sort Name ascending (A-Z)
Stars
Accurate Retraining-free Pruning for Pretrained Encoder-based Language Models (ICLR 2024)
CUDA Templates and Python DSLs for High-Performance Linear Algebra
SGLang is a high-performance serving framework for large language models and multimodal models.
tiktoken is a fast BPE tokeniser for use with OpenAI's models.
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
FlashInfer: Kernel Library for LLM Serving
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
A fast inference library for running LLMs locally on modern consumer-class GPUs
QLoRA: Efficient Finetuning of Quantized LLMs
Fast and memory-efficient exact attention
4 bits quantization of LLaMA using GPTQ
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
High-performance In-browser LLM Inference Engine
psrivas2 / relax
Forked from tlc-pack/relaxTemp repo for prototyping relax(relay next), the effort will be upstreamed. We use the wiki pages on this repo to host design docs.
[NeurIPS 2020] MCUNet: Tiny Deep Learning on IoT Devices; [NeurIPS 2021] MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning; [NeurIPS 2022] MCUNetV3: On-Device Training Under 2…
An Open Source Machine Learning Framework for Everyone
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
The Tensor Algebra SuperOptimizer for Deep Learning
Development repository for the Triton language and compiler
Open single and half precision gemm implementations