Stars
A framework for efficient model inference with omni-modality models
A modern web interface for managing and interacting with vLLM servers (www.github.com/vllm-project/vllm). Supports both GPU and CPU modes, with special optimizations for macOS Apple Silicon and ent…
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
No fortress, purely open ground. OpenManus is Coming.
QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.
Submission for the SG Innovation Challenge
Accessible large language models via k-bit quantization for PyTorch.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Efficient LLM Inference over Long Sequences
Master programming by recreating your favorite technologies from scratch.
SGLang is a fast serving framework for large language models and vision language models.
Open-source observability for your GenAI or LLM application, based on OpenTelemetry
LLM Serving Performance Evaluation Harness
Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild
A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 🍓 and reasoning techniques.
Efficient and easy multi-instance LLM serving
Artifact of OSDI '24 paper, ”Llumnix: Dynamic Scheduling for Large Language Model Serving“
HabanaAI / vllm-fork
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Vision-Augmented Retrieval and Generation (VARAG) - Vision first RAG Engine
A collection of best resources to learn System Design, Software architecture, and prepare for System Design Interviews
[NeurIPS 2024] Official Repository of The Mamba in the Llama: Distilling and Accelerating Hybrid Models
A throughput-oriented high-performance serving framework for LLMs
Learn System Design concepts and prepare for interviews using free resources.