V LLM

vLLM is a fast and user-friendly library designed for large language model (LLM) inference and serving, featuring high throughput and efficient memory management. It supports various hardware and offers seamless integration with popular HuggingFace models, along with multiple decoding algorithms. The documentation includes installation guides, performance tuning, and community resources for developers.

Uploaded by

Aleks Balaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views5 pages

V LLM

Uploaded by

Aleks Balaban

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

02/12/2024, 22:08 Welcome to vLLM!

— vLLM

Print to PDF
Welcome to vLLM!
Contents
Welcome to vLLM!
Indices and tables

Easy, fast, and cheap LLM serving for everyone

Star Watch Fork
vLLM is a fast and easy-to-use library for LLM inference and serving.
vLLM is fast with:
State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantization: GPTQ, AWQ, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer.
Speculative decoding Ask AI
Chunked prefill
vLLM is flexible and easy to use with:
Seamless integration with popular HuggingFace models
High-throughput serving with various decoding algorithms, including parallel sampling,
beam search, and more latest

Tensor parallelism and pipeline parallelism support for distributed inference

https://docs.vllm.ai/en/latest/ 1/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Streaming outputs
OpenAI-compatible API server
Support NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs, Gaudi® accelerators and
GPUs, PowerPC CPUs, TPU, and AWS Trainium and Inferentia Accelerators.
Prefix caching support
Multi-lora support
For more information, check out the following:
vLLM announcing blog post (intro to PagedAttention)
vLLM paper (SOSP 2023)
How continuous batching enables 23x throughput in LLM inference while reducing p50
latency by Cade Daniel et al.
vLLM Meetups.

Documentation
Getting Started
Installation
Installation with ROCm
Installation with OpenVINO
Installation with CPU
Installation with Intel® Gaudi® AI Accelerators
Installation for ARM CPUs
Installation with Neuron
Installation with TPU
Installation with XPU
Quickstart Ask AI

Debugging Tips
Examples
Serving
OpenAI Compatible Server latest
Deploying with Docker
Deploying with Kubernetes
https://docs.vllm.ai/en/latest/ 2/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Deploying with Nginx Loadbalancer

Distributed Inference and Serving
Production Metrics
Environment Variables
Usage Stats Collection
Integrations
Loading Models with CoreWeave’s Tensorizer
Compatibility Matrix
Frequently Asked Questions
Models
Supported Models
Model Support Policy
Adding a New Model
Enabling Multimodal Inputs
Engine Arguments
Using LoRA adapters
Using VLMs
Structured Outputs
Speculative decoding in vLLM
Performance and Tuning
Quantization
Supported Hardware for Quantization Kernels
AutoAWQ
BitsAndBytes
GGUF Ask AI
INT8 W8A8
FP8 W8A8
FP8 E5M2 KV Cache
FP8 E4M3 KV Cache
Automatic Prefix Caching latest

Introduction
https://docs.vllm.ai/en/latest/ 3/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Implementation
Performance
Benchmark Suites
Community
vLLM Meetups
Sponsors
API Documentation
Sampling Parameters
SamplingParams

Pooling Parameters
PoolingParams

Offline Inference
LLM Class
LLM Inputs
vLLM Engine
LLMEngine
AsyncLLMEngine
Design
Architecture Overview
Entrypoints
LLM Engine
Worker
Model Runner
Model Ask AI
Class Hierarchy
Integration with HuggingFace
vLLM’s Plugin System
How Plugins Work in vLLM
How vLLM Discovers Plugins
What Can Plugins Do? latest

Guidelines for Writing Plugins

https://docs.vllm.ai/en/latest/ 4/5
02/12/2024, 22:08 Welcome to vLLM! — vLLM

Compatibility Guarantee
Input Processing
Guides
Module Contents
vLLM Paged Attention
Inputs
Concepts
Query
Key
QK
Softmax
Value
LV
Output
Multi-Modality
Guides
Module Contents
For Developers
Contributing to vLLM
License
Developing
Testing
Contribution Guidelines
Issues
Pull Requests & Code Reviews
Thank You Ask AI
Profiling vLLM
Example commands and usage:
Dockerfile
Index
Module Index
latest

https://docs.vllm.ai/en/latest/ 5/5

Docs VLLM Ai en Stable
No ratings yet
Docs VLLM Ai en Stable
35 pages
Docs VLLM Ai en v0.6.1
No ratings yet
Docs VLLM Ai en v0.6.1
215 pages
Optimize LLMs with vLLM Library
No ratings yet
Optimize LLMs with vLLM Library
6 pages
Optimizing Large Language Models With VLLM and Related Tools
No ratings yet
Optimizing Large Language Models With VLLM and Related Tools
16 pages
2025-05-08 - VLLM Office Hours #25 - Structured Outputs in VLLM
No ratings yet
2025-05-08 - VLLM Office Hours #25 - Structured Outputs in VLLM
63 pages
2025-03-06 - VLLM Office Hours - VLLM Production Stack
No ratings yet
2025-03-06 - VLLM Office Hours - VLLM Production Stack
52 pages
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
No ratings yet
Efficient Memory Management For LLM Model Serving With Paged Attention Sep 2023
16 pages
Emerging Architectures For LLM Applications - Andreessen Horowitz
No ratings yet
Emerging Architectures For LLM Applications - Andreessen Horowitz
15 pages
LLMOps Toolkit - Prashant Sahu
No ratings yet
LLMOps Toolkit - Prashant Sahu
12 pages
AI Models & Tools for Developers
No ratings yet
AI Models & Tools for Developers
5 pages
Understanding The LLM Inference Workload
No ratings yet
Understanding The LLM Inference Workload
63 pages
Choosing The Right Inference Framework - LLM Inference Handbook
No ratings yet
Choosing The Right Inference Framework - LLM Inference Handbook
3 pages
Build Your Multimodal RAG System
No ratings yet
Build Your Multimodal RAG System
19 pages
Stas Bekman - Machine Learning Engineering
100% (1)
Stas Bekman - Machine Learning Engineering
217 pages
Machine Learning Engineering Guide
No ratings yet
Machine Learning Engineering Guide
308 pages
DeepSeek-VL: Open-Source Vision-Language Model
No ratings yet
DeepSeek-VL: Open-Source Vision-Language Model
33 pages
Tutorial Membuat RAG AI ChatBot API Dengan Python FastAPI Dan Open Source LLMs
No ratings yet
Tutorial Membuat RAG AI ChatBot API Dengan Python FastAPI Dan Open Source LLMs
41 pages
Efficient Multi-Tenant LLM Serving
No ratings yet
Efficient Multi-Tenant LLM Serving
15 pages
Stas Bekman - Machine Learning Engineering
No ratings yet
Stas Bekman - Machine Learning Engineering
261 pages
GenAI LLM Foundations and Building Blocks
No ratings yet
GenAI LLM Foundations and Building Blocks
6 pages
LangChain LLM Programming Guide
No ratings yet
LangChain LLM Programming Guide
39 pages
Hands-On Lab With LLMs and Gen AI Within IDC
No ratings yet
Hands-On Lab With LLMs and Gen AI Within IDC
57 pages
LLM Overview
No ratings yet
LLM Overview
3 pages
Own Your AI - Tech Deck
No ratings yet
Own Your AI - Tech Deck
75 pages
Efficient Multimodal Large Language Models - A Survey
No ratings yet
Efficient Multimodal Large Language Models - A Survey
36 pages
AI & LLMs: A Developer's Guide
100% (1)
AI & LLMs: A Developer's Guide
2 pages
G LLM I m0 l2 File en 2pdfen - 240109 - 041629
No ratings yet
G LLM I m0 l2 File en 2pdfen - 240109 - 041629
3 pages
LLMs in Python Free Course by Inder P Singh
No ratings yet
LLMs in Python Free Course by Inder P Singh
28 pages
LLMs in Production-MLC - GRC
No ratings yet
LLMs in Production-MLC - GRC
39 pages
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
No ratings yet
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
15 pages
A Survey of Efficient LLM Inference Serving
No ratings yet
A Survey of Efficient LLM Inference Serving
20 pages
Fast Distributed Inference Serving For Large Language Models
No ratings yet
Fast Distributed Inference Serving For Large Language Models
15 pages
LLMs: Applications & Challenges
No ratings yet
LLMs: Applications & Challenges
30 pages
Exploring
No ratings yet
Exploring
16 pages
Lecture 1
No ratings yet
Lecture 1
100 pages
How To Run LLMs Locally
No ratings yet
How To Run LLMs Locally
8 pages
Visual Large Language Models For Generalized and Specialized Applications
No ratings yet
Visual Large Language Models For Generalized and Specialized Applications
29 pages
4-HC24.PrimisAI - Hans Bouwmeester.v4
No ratings yet
4-HC24.PrimisAI - Hans Bouwmeester.v4
29 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
Belgian NLP Meetup Dec 2023
No ratings yet
Belgian NLP Meetup Dec 2023
32 pages
Lesson 40 - Your Entry Gate To The Open Source AI - Call Notes
No ratings yet
Lesson 40 - Your Entry Gate To The Open Source AI - Call Notes
11 pages
Data Engineer Generative Ai
No ratings yet
Data Engineer Generative Ai
17 pages
Devs: Build PDF Bots with Open-Source LLMs
No ratings yet
Devs: Build PDF Bots with Open-Source LLMs
13 pages
RADL LHPhuong
No ratings yet
RADL LHPhuong
66 pages
2025 04 22 Intro To LLMsv1
No ratings yet
2025 04 22 Intro To LLMsv1
41 pages
Large Language Models (LLMS) - Architecture, Training, Applications, and Challenges
No ratings yet
Large Language Models (LLMS) - Architecture, Training, Applications, and Challenges
5 pages
LLM Guide for Interns
No ratings yet
LLM Guide for Interns
4 pages
GenAI Detailed PPT Outline
No ratings yet
GenAI Detailed PPT Outline
5 pages
Vision-Language Models Intro Guide
No ratings yet
Vision-Language Models Intro Guide
76 pages
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
No ratings yet
ML A Deep Dive in The World of AI and LLM Tun'Up Munich - 241021 - 130023
34 pages
1 s2.0 S2666651024000111 Main
No ratings yet
1 s2.0 S2666651024000111 Main
26 pages
MiniCPM-V: A GPT-4V Level MLLM On Your Phone
No ratings yet
MiniCPM-V: A GPT-4V Level MLLM On Your Phone
26 pages
10+ Ways To Run Open-Source Models With LlamaIndex - by Wenqi Glantz
100% (1)
10+ Ways To Run Open-Source Models With LlamaIndex - by Wenqi Glantz
34 pages
LLM Intro
No ratings yet
LLM Intro
8 pages
Pieces DZ RC 393 Getting Started Llms 2024
No ratings yet
Pieces DZ RC 393 Getting Started Llms 2024
8 pages
Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
Generative AI - 48 Hours TOC
67% (3)
Generative AI - 48 Hours TOC
4 pages
SESSION 1 LLMs
No ratings yet
SESSION 1 LLMs
40 pages
Netbotz Camera Pod 165: Specifications Sheet
No ratings yet
Netbotz Camera Pod 165: Specifications Sheet
1 page
Realtek Driver For Windows 10
No ratings yet
Realtek Driver For Windows 10
5 pages
MATLAB for Image Processing Experts
No ratings yet
MATLAB for Image Processing Experts
71 pages
VIBROTEST & VIBROPORT Sensors Guide
No ratings yet
VIBROTEST & VIBROPORT Sensors Guide
14 pages
RH2288H V3 Server V100R003 User Guide 09
No ratings yet
RH2288H V3 Server V100R003 User Guide 09
266 pages
ZFS USB Disk Install
No ratings yet
ZFS USB Disk Install
5 pages
AESJ902 Electrical Systems for Instrumentation沙特阿美仪表规范
No ratings yet
AESJ902 Electrical Systems for Instrumentation沙特阿美仪表规范
33 pages
PROII User-Added Subroutines User Guide
100% (1)
PROII User-Added Subroutines User Guide
536 pages
LabJack U3 Datasheet Export 20160108
No ratings yet
LabJack U3 Datasheet Export 20160108
109 pages
Cable Trays and Accessories Method Statement
No ratings yet
Cable Trays and Accessories Method Statement
21 pages
Unit-I Introduction To PLC: What Is The Programmable Logic Controller (PLC) ?
No ratings yet
Unit-I Introduction To PLC: What Is The Programmable Logic Controller (PLC) ?
17 pages
Siemens PLC Connection Guide
No ratings yet
Siemens PLC Connection Guide
11 pages
Junction 6-Point 35Kv 200amp Wells: Product-Details
No ratings yet
Junction 6-Point 35Kv 200amp Wells: Product-Details
2 pages
SoC Power Management with UPF
No ratings yet
SoC Power Management with UPF
7 pages
MPMC PBL
No ratings yet
MPMC PBL
11 pages
KOSPET TANK T3 Smartwatch User Manual
No ratings yet
KOSPET TANK T3 Smartwatch User Manual
140 pages
Aksialni 20092
No ratings yet
Aksialni 20092
4 pages
Technical Manual
No ratings yet
Technical Manual
22 pages
Standard Allura - FD20 - Ceiling - Cms PHILLIPS
100% (2)
Standard Allura - FD20 - Ceiling - Cms PHILLIPS
3 pages
Final Report
No ratings yet
Final Report
2 pages
Apple Store Design
0% (1)
Apple Store Design
8 pages
Acer Aspire x1400 X1420, Emachines EL1358 Wistron Eboxer MANALO
No ratings yet
Acer Aspire x1400 X1420, Emachines EL1358 Wistron Eboxer MANALO
45 pages
Ogre Tutorial
100% (1)
Ogre Tutorial
57 pages
Unit - 4 Graphical Input Devices and Techniques
No ratings yet
Unit - 4 Graphical Input Devices and Techniques
35 pages
Unipro GbEth Datasheet
No ratings yet
Unipro GbEth Datasheet
2 pages
Toshiba Aplio Manual Operation Basic 2B730 - 595E - C
No ratings yet
Toshiba Aplio Manual Operation Basic 2B730 - 595E - C
154 pages
GPU Basics for Electronics Students
No ratings yet
GPU Basics for Electronics Students
22 pages
Artec Studio 12: Advanced 3D Scanning Software
No ratings yet
Artec Studio 12: Advanced 3D Scanning Software
6 pages
Digital Testing of High Voltage Circuit Breaker Seminar Topics With Reports and PPT in PDF DOC For EEE Students
100% (2)
Digital Testing of High Voltage Circuit Breaker Seminar Topics With Reports and PPT in PDF DOC For EEE Students
3 pages
Proposal 1
No ratings yet
Proposal 1
1 page

V LLM

Uploaded by

V LLM

Uploaded by

02/12/2024, 22:08 Welcome to vLLM!

Easy, fast, and cheap LLM serving for everyone

Tensor parallelism and pipeline parallelism support for distributed inference

Deploying with Nginx Loadbalancer

Guidelines for Writing Plugins

You might also like