vLLM Documentation
by vLLM (originally UC Berkeley Sky Computing Lab)
The official docs for the high-throughput inference engine that serves LLMs in production.
Overview
vLLM is a fast and easy-to-use library for LLM inference and serving, originally created at UC Berkeley's Sky Computing Lab and now a major open-source project with contributors across academia and industry. The official documentation is organized for three audiences: users who want to run models (a quickstart path), developers building applications on vLLM (user guides), and contributors working on the engine itself (developer guides). It covers getting started and running open-source models, building applications, and distributed inference using tensor, pipeline, and data parallelism. Key performance features are documented in depth — PagedAttention for memory-efficient KV-cache management, continuous batching for high throughput, and speculative decoding — along with a drop-in OpenAI-compatible API server for instant integration into existing stacks. The docs note support for 200+ model architectures, multiple hardware platforms (NVIDIA, AMD, CPUs, TPUs and more), multi-LoRA serving, and structured/guided output generation, making it a go-to reference for taking LLMs from a notebook to a production serving endpoint.
At a Glance
- Topic
- Frameworks
- Level
- Intermediate
- Format
- Documentation
- Cost
- Free
- Duration
- Self-paced reference + guides
- Provider
- vLLM (originally UC Berkeley Sky Computing Lab)
- Hands-on
- Yes — code/exercises
- Certificate
- None
What You’ll Learn
- ✓How to serve open-source LLMs with vLLM and expose an OpenAI-compatible API endpoint
- ✓How PagedAttention and continuous batching deliver high-throughput, memory-efficient inference
- ✓How to run distributed inference with tensor, pipeline, and data parallelism across GPUs
- ✓How to use advanced features like speculative decoding, multi-LoRA serving, and structured output
- ✓How to tune serving for throughput and latency across supported hardware platforms
Highlights
- •Industry-standard open-source inference engine, born at UC Berkeley's Sky Computing Lab
- •Drop-in OpenAI-compatible API server for instant integration
- •PagedAttention + continuous batching for state-of-the-art throughput and memory efficiency
- •Supports 200+ model architectures and many hardware backends (NVIDIA, AMD, CPU, TPU)
Who It’s For
Best For
- ✓AI/ML engineers deploying and serving LLMs in production
- ✓Platform/infrastructure teams building inference endpoints
- ✓Developers who need high-throughput, low-cost local or self-hosted model serving
Prerequisites
- •Python and command-line proficiency
- •Basic understanding of LLM inference, GPUs, and serving/APIs
FAQ
What is vLLM Documentation?
The official documentation for vLLM, a fast, memory-efficient library for LLM inference and serving. Aimed at AI engineers who need to serve open-source models at scale with an OpenAI-compatible API, high throughput, and efficient GPU memory use.
Is vLLM Documentation free?
vLLM Documentation is free to access.
What level is vLLM Documentation for?
vLLM Documentation is aimed at a intermediate audience. Recommended background: Python and command-line proficiency, Basic understanding of LLM inference, GPUs, and serving/APIs.
How long does vLLM Documentation take?
Expect roughly Self-paced reference + guides. Most learners work through it at their own pace.
What will I learn from vLLM Documentation?
You'll learn: How to serve open-source LLMs with vLLM and expose an OpenAI-compatible API endpoint; How PagedAttention and continuous batching deliver high-throughput, memory-efficient inference; How to run distributed inference with tensor, pipeline, and data parallelism across GPUs; How to use advanced features like speculative decoding, multi-LoRA serving, and structured output; How to tune serving for throughput and latency across supported hardware platforms.