02/12/2024, 22:08                                      Welcome to vLLM!
— vLLM
              Print to PDF
                                  Welcome to vLLM!
         Contents
                     Welcome to vLLM!
                     Indices and tables
                                  Easy, fast, and cheap LLM serving for everyone
                                                    Star Watch Fork
             vLLM is a fast and easy-to-use library for LLM inference and serving.
             vLLM is fast with:
                State-of-the-art serving throughput
                Efficient management of attention key and value memory with PagedAttention
                Continuous batching of incoming requests
                Fast model execution with CUDA/HIP graph
                Quantization: GPTQ, AWQ, INT4, INT8, and FP8
                Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
                Speculative decoding                                                          Ask AI
                Chunked prefill
             vLLM is flexible and easy to use with:
                Seamless integration with popular HuggingFace models
                High-throughput serving with various decoding algorithms, including parallel sampling,
                beam search, and more                                                             latest
                Tensor parallelism and pipeline parallelism support for distributed inference
https://docs.vllm.ai/en/latest/                                                                            1/5
02/12/2024, 22:08                                    Welcome to vLLM! — vLLM
                 Streaming outputs
                 OpenAI-compatible API server
                 Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and
                 GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
                 Prefix caching support
                 Multi-lora support
             For more information, check out the following:
                 vLLM announcing blog post (intro to PagedAttention)
                 vLLM paper (SOSP 2023)
                 How continuous batching enables 23x throughput in LLM inference while reducing p50
                 latency by Cade Daniel et al.
                 vLLM Meetups.
             Documentation
             Getting Started
             Installation
             Installation with ROCm
             Installation with OpenVINO
             Installation with CPU
             Installation with Intel® Gaudi® AI Accelerators
             Installation for ARM CPUs
             Installation with Neuron
             Installation with TPU
             Installation with XPU
             Quickstart                                                                 Ask AI
             Debugging Tips
             Examples
             Serving
             OpenAI Compatible Server                                                       latest
             Deploying with Docker
             Deploying with Kubernetes
https://docs.vllm.ai/en/latest/                                                                       2/5
02/12/2024, 22:08                                 Welcome to vLLM! — vLLM
             Deploying with Nginx Loadbalancer
             Distributed Inference and Serving
             Production Metrics
             Environment Variables
             Usage Stats Collection
             Integrations
             Loading Models with CoreWeave’s Tensorizer
             Compatibility Matrix
             Frequently Asked Questions
             Models
             Supported Models
             Model Support Policy
             Adding a New Model
             Enabling Multimodal Inputs
             Engine Arguments
             Using LoRA adapters
             Using VLMs
             Structured Outputs
             Speculative decoding in vLLM
             Performance and Tuning
             Quantization
             Supported Hardware for Quantization Kernels
             AutoAWQ
             BitsAndBytes
             GGUF                                                           Ask AI
             INT8 W8A8
             FP8 W8A8
             FP8 E5M2 KV Cache
             FP8 E4M3 KV Cache
             Automatic Prefix Caching                                           latest
             Introduction
https://docs.vllm.ai/en/latest/                                                          3/5
02/12/2024, 22:08                                Welcome to vLLM! — vLLM
             Implementation
             Performance
             Benchmark Suites
             Community
             vLLM Meetups
             Sponsors
             API Documentation
             Sampling Parameters
                    SamplingParams
             Pooling Parameters
                    PoolingParams
             Offline Inference
                LLM Class
                LLM Inputs
             vLLM Engine
                LLMEngine
                AsyncLLMEngine
             Design
             Architecture Overview
                Entrypoints
                LLM Engine
                Worker
                Model Runner
                Model                                                      Ask AI
                Class Hierarchy
             Integration with HuggingFace
             vLLM’s Plugin System
                How Plugins Work in vLLM
                How vLLM Discovers Plugins
                What Can Plugins Do?                                           latest
                Guidelines for Writing Plugins
https://docs.vllm.ai/en/latest/                                                         4/5
02/12/2024, 22:08                              Welcome to vLLM! — vLLM
                Compatibility Guarantee
             Input Processing
                Guides
                Module Contents
             vLLM Paged Attention
                Inputs
                Concepts
                Query
                Key
                QK
                Softmax
                Value
                LV
                Output
             Multi-Modality
                Guides
                Module Contents
             For Developers
             Contributing to vLLM
                License
                Developing
                Testing
             Contribution Guidelines
                Issues
                Pull Requests & Code Reviews
                Thank You                                                Ask AI
             Profiling vLLM
                Example commands and usage:
             Dockerfile
                     Index
                     Module Index
                                                                             latest
https://docs.vllm.ai/en/latest/                                                       5/5