vLLM: Production LLM Serving from Zero to Scale¶
You've downloaded a large language model. You've got it running. But you notice something uncomfortable: it's slow, it can only handle one request at a time, and your GPU is mysteriously underutilized. The moment two people try to use your model at the same time, one of them waits — and waits.
This is the LLM serving problem, and vLLM is the most widely adopted open-source solution to it.
Part 1: The Problem vLLM Solves¶
Why Naïve LLM Serving Is Slow¶
When you load a model with model.generate(...), a few things happen behind the scenes that make it unsuitable for serving multiple users:
Problem 1: Memory Waste from Static KV Cache Allocation¶
Every time a model processes input tokens, it computes Key-Value (KV) tensors for the attention mechanism. These tensors are large and need to be stored in GPU memory throughout the generation.
A request with 512 token context on a 7B model:
KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × dtype_bytes
= 2 × 32 × 32 × 128 × 512 × 2 bytes
= ~268 MB per request
If you pre-allocate for the maximum sequence length (4096):
268 MB × (4096/512) = 2.1 GB — even if the actual output is 50 tokens!
Naïve frameworks allocate the maximum possible memory upfront. Most of it sits empty and wasted.
Problem 2: No Concurrent Request Handling¶
Standard inference is sequential: process request 1 completely, then request 2, then request 3.
Naïve server:
Time →
GPU: [===Req1===][===Req2===][===Req3===]
40s 40s 40s Total: 120s
User 1 OK User 2 waits 40s User 3 waits 80s
The GPU sits partially idle whenever it's waiting for the next batch of tokens to generate.
Problem 3: Memory Fragmentation¶
Even if you try to batch requests, different requests have different lengths. Memory reserved for one long request can't be used by three short requests.
GPU Memory (24 GB):
[====Request A (2048 max)====][====Request B (2048 max)====][ 12 GB free ]
^^ But Request A only uses 200 tokens! 1.8 GB wasted ^^
What vLLM Achieves¶
vLLM addresses all three problems and delivers:
| Metric | Naïve serving | vLLM |
|---|---|---|
| Throughput | 1–3 req/s (7B) | 20–50+ req/s (7B) |
| GPU Memory Efficiency | 40–60% utilized | 90–95% utilized |
| Concurrent users | 1 | Hundreds |
| Latency (first token) | Same | Same or faster |
| Latency (full response) | Baseline | 10–24× better throughput |
Part 2: How vLLM Works — The Core Innovations¶
Innovation 1: PagedAttention¶
PagedAttention is vLLM's headline invention, introduced in a 2023 UC Berkeley paper. It borrows an idea from operating systems: virtual memory paging.
In a traditional OS, physical RAM is divided into fixed-size pages. A program's memory isn't physically contiguous — it's scattered across pages, and a page table maps virtual addresses to physical pages. This eliminates fragmentation.
PagedAttention applies the same idea to the KV cache:
Traditional KV Cache (pre-vLLM):
GPU Memory:
┌──────────────────────────────────────────────────────┐
│ Request A: [tok0][tok1][tok2][EMPTY][EMPTY][EMPTY] │ ← 50% wasted
│ Request B: [tok0][tok1][EMPTY][EMPTY][EMPTY][EMPTY] │ ← 67% wasted
│ Request C: NO ROOM — must wait │
└──────────────────────────────────────────────────────┘
PagedAttention KV Cache:
GPU Memory divided into fixed-size BLOCKS (e.g., 16 tokens each):
┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
│ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │ B8 │ B9 │
└────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘
Request A uses blocks: B0, B2, B7 (only what it needs)
Request B uses blocks: B1, B4 (only what it needs)
Request C uses blocks: B3, B5, B8, B9 (fits in the gaps!)
Block table maps logical blocks to physical blocks:
Request A → [B0 at pos 0, B2 at pos 1, B7 at pos 2]
Key benefits: - Near-zero internal fragmentation (< 4% wasted per block) - Multiple requests can share blocks for identical prefixes (prompt caching) - New requests can start immediately as blocks become available
Innovation 2: Continuous Batching¶
Naïve batching waits for all requests in a batch to finish before starting new ones. This is wasteful because requests finish at different times:
Static batching:
Batch 1: [Req A ===30tok===][Req B ======50tok======]
Batch 2: starts only after BOTH A and B finish (50 tok wait)
GPU idle during last 20 tokens of B waiting for A!
Continuous batching (vLLM):
Iteration 0: [A][B][C]
Iteration 1: [A][B][C]
...
Iteration 30: [A finishes][B][C] → immediately swap in [D]!
Iteration 31: [D][B][C] ← no gap
...
GPU is always doing useful work
This is sometimes called "iteration-level scheduling" — the scheduler makes decisions at every single token generation step, not at the batch level.
Innovation 3: Prefix Caching (Prompt Caching)¶
When multiple requests share the same system prompt or context, vLLM can cache the KV blocks for the shared prefix and reuse them across requests — no recomputation needed.
System prompt: "You are a helpful customer service agent for AcmeCorp..."
[=============== 500 tokens ===============]
Request 1: [shared prefix 500 tok] + "What are your hours?"
Request 2: [shared prefix 500 tok] + "How do I return a product?"
Request 3: [shared prefix 500 tok] + "Where is my order?"
Without prefix caching: compute 500-token prefix 3 times
With prefix caching: compute 500-token prefix ONCE, reuse KV cache
Speedup: ~35% for typical RAG/chat applications
Innovation 4: Speculative Decoding¶
Speculative decoding uses a small "draft" model to generate multiple candidate tokens quickly, then verifies them with the large model in parallel:
Normal generation (LLaMA 70B):
Generate token 1 → Generate token 2 → Generate token 3 ...
Each step: full 70B forward pass
Speculative decoding:
Draft model (1B) generates 5 candidate tokens: ["The", "cat", "sat", "on", "the"]
Large model (70B) verifies all 5 in ONE forward pass
Accept tokens up to first mismatch, reject rest, continue
Result: 2–4× more tokens per second for easy/predictable text
Part 3: Installation and First Steps¶
Prerequisites¶
| Requirement | Minimum | Recommended |
|---|---|---|
| GPU | NVIDIA (CUDA 11.8+) | NVIDIA (CUDA 12.x) |
| VRAM | 8 GB (7B model at fp16... tight) | 24 GB (7B comfortable), 80 GB (70B) |
| RAM | 32 GB | 64 GB+ |
| Python | 3.9 | 3.11+ |
| OS | Linux | Linux (Ubuntu 22.04) |
| Driver | CUDA 11.8 | CUDA 12.4+ |
VRAM Requirements
vLLM loads the full model weights into GPU VRAM. A 7B parameter model at float16 needs roughly 14 GB VRAM just for weights, plus KV cache overhead. Plan for:
- 7B model → 16–24 GB GPU (RTX 4090, A10G, L4)
- 13B model → 28–40 GB GPU (A100 40GB)
- 70B model → 2× A100 80GB, or 4× A10G
Installation¶
Your First Inference¶
# hello_vllm.py
from vllm import LLM, SamplingParams
# Load model — this downloads from HuggingFace on first run
# (set HF_TOKEN env var for gated models like LLaMA)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
# Sampling parameters
params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256
)
# Generate — notice we pass a LIST of prompts (batching!)
prompts = [
"What is PagedAttention in vLLM?",
"Write a Python function to check if a number is prime.",
"Explain the concept of entropy in information theory.",
]
outputs = llm.generate(prompts, params)
for output in outputs:
prompt = output.prompt
response = output.outputs[0].text
print(f"Prompt: {prompt!r}")
print(f"Response: {response}")
print("─" * 60)
Run it:
You'll see vLLM process all three prompts in a single batched pass — far more efficient than calling generate three times separately.
Part 4: The OpenAI-Compatible Server¶
This is vLLM's killer feature for most teams. By exposing an OpenAI-compatible REST API, you can point any existing tool or SDK at your vLLM server and it just works — no code changes needed.
Starting the Server¶
# Basic startup
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000
# With common options
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \ # context window
--tensor-parallel-size 2 \ # use 2 GPUs
--gpu-memory-utilization 0.90 \ # use 90% of GPU VRAM
--enable-prefix-caching \ # enable prompt caching
--served-model-name "my-llama" # custom model name for API
Calling from Python¶
from openai import OpenAI
# Point to your vLLM server instead of api.openai.com
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require a key by default
)
# Chat completions — identical to calling OpenAI API
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of Thailand?"}
],
temperature=0.7,
max_tokens=200
)
print(response.choices[0].message.content)
print(f"Usage: {response.usage}")
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# Stream the response token by token
stream = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "Tell me a short story."}],
stream=True # ← enable streaming
)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print() # newline at end
import asyncio
import httpx
async def chat(prompt: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/v1/chat/completions",
json={
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 300,
},
timeout=60.0
)
return response.json()["choices"][0]["message"]["content"]
async def main():
# Run 5 requests concurrently — vLLM handles them all
results = await asyncio.gather(*[
chat(f"Question {i}: What is {topic}?")
for i, topic in enumerate(["Python", "Kubernetes", "vLLM", "LLM", "DevOps"])
])
for r in results:
print(r[:100], "...\n")
asyncio.run(main())
Checking Server Health and Models¶
# List available models
curl http://localhost:8000/v1/models | python -m json.tool
# Health check
curl http://localhost:8000/health
# Server metrics (Prometheus format)
curl http://localhost:8000/metrics
Part 5: Key Configuration Options Explained¶
Every configuration flag affects either quality, speed, or memory. Understanding the trade-offs helps you tune for your workload.
GPU Memory Utilization¶
Controls what fraction of GPU VRAM vLLM reserves for the KV cache. The rest goes to model weights.
GPU VRAM = 24 GB
Model weights (7B fp16) ≈ 14 GB
With --gpu-memory-utilization 0.90:
Reserved = 24 × 0.90 = 21.6 GB
KV cache = 21.6 - 14 = 7.6 GB → supports ~180 concurrent 512-token requests
With --gpu-memory-utilization 0.70:
Reserved = 24 × 0.70 = 16.8 GB
KV cache = 16.8 - 14 = 2.8 GB → supports ~66 concurrent requests
If you see OOM errors → lower this value
If you want more concurrent users → raise this value (carefully)
Max Model Length¶
A smaller context window = smaller maximum KV cache = more requests fit simultaneously.
If your users never need more than 4K tokens context:
Setting --max-model-len 4096 on a model that supports 128K
means the KV cache is sized for 4K, not 128K
→ 32× more efficient use of KV cache memory!
Tensor Parallelism (Multi-GPU)¶
Tensor parallelism splits the model's weight matrices across multiple GPUs. Each GPU holds a shard of every layer and they communicate via NVLink or PCIe.
Single GPU (A100 80GB): Can serve LLaMA 70B with tight KV cache
2× GPU (A100 80GB): Comfortable 70B, large KV cache, faster
4× GPU (A100 80GB): Even faster, large context windows (128K+)
8× GPU (A100 80GB): 405B models
When to use tensor parallelism
- Model doesn't fit on one GPU → use TP
- Model fits but you need more throughput → try data parallelism (multiple server instances) first — it has less overhead than TP
Quantization¶
Quantization reduces model precision to use less memory and run faster:
Quantization comparison:
| Method | Bits | Quality Loss | VRAM vs FP16 | Speed |
|---|---|---|---|---|
| FP16 | 16 | None (baseline) | 1× | Baseline |
| BF16 | 16 | None | 1× | Same |
| FP8 | 8 | Minimal | 0.5× | 1.5–2× faster |
| GPTQ | 4 | Small | 0.25× | 1.2× faster |
| AWQ | 4 | Minimal (better calibration) | 0.25× | 1.2× faster |
Sampling Parameters Deep Dive¶
from vllm import SamplingParams
params = SamplingParams(
# Core sampling
temperature=0.7, # 0 = deterministic, >1 = more random
top_p=0.9, # nucleus sampling: top tokens summing to 90% probability
top_k=50, # only consider top 50 tokens
# Length control
max_tokens=512, # maximum tokens to generate
min_tokens=10, # minimum before EOS is allowed
# Repetition control
presence_penalty=0.1, # penalize tokens that already appeared
frequency_penalty=0.1,# penalize tokens proportional to how often they appeared
repetition_penalty=1.1,# multiplicative penalty (>1 reduces repetition)
# Stopping conditions
stop=["<|eot_id|>", "Human:", "User:"], # stop on these strings
stop_token_ids=[128009], # stop on these token IDs
# Multiple outputs
n=3, # generate 3 different completions
best_of=5, # generate 5 internally, return the 3 highest-scoring
# Determinism
seed=42, # for reproducible outputs
)
Part 6: Practical Deployment Patterns¶
Pattern 1: Simple Docker Deployment¶
# docker-compose.yml
version: "3.8"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_TOKEN=${HF_TOKEN} # for gated models like LLaMA
- HF_HOME=/cache/huggingface
volumes:
- huggingface-cache:/cache/huggingface
ports:
- "8000:8000"
command: >
--model meta-llama/Meta-Llama-3-8B-Instruct
--host 0.0.0.0
--port 8000
--max-model-len 8192
--gpu-memory-utilization 0.90
--enable-prefix-caching
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 120s # model loading takes time!
volumes:
huggingface-cache:
# Start
HF_TOKEN=your_token docker compose up -d
# Check logs
docker compose logs -f vllm
# Test
curl http://localhost:8000/v1/models
Pattern 2: Kubernetes Deployment¶
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
namespace: ai
spec:
replicas: 1 # scale based on load
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Meta-Llama-3-8B-Instruct"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--gpu-memory-utilization"
- "0.90"
- "--enable-prefix-caching"
ports:
- containerPort: 8000
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-credentials
key: token
resources:
requests:
nvidia.com/gpu: "1"
memory: "32Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "48Gi"
cpu: "8"
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
volumeMounts:
- name: hf-cache
mountPath: /root/.cache/huggingface
volumes:
- name: hf-cache
persistentVolumeClaim:
claimName: hf-model-cache
nodeSelector:
accelerator: nvidia-a10g # schedule on GPU nodes
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: ai
spec:
selector:
app: vllm-server
ports:
- port: 80
targetPort: 8000
type: ClusterIP
Pattern 3: Multiple Models with a Router¶
For serving multiple models, use a lightweight router (like LiteLLM) in front of multiple vLLM instances:
# litellm-config.yaml
model_list:
- model_name: "fast-model" # alias exposed to clients
litellm_params:
model: openai/meta-llama/Meta-Llama-3-8B-Instruct
api_base: http://vllm-8b:8000/v1
api_key: not-needed
- model_name: "smart-model"
litellm_params:
model: openai/meta-llama/Meta-Llama-3-70B-Instruct
api_base: http://vllm-70b:8000/v1
api_key: not-needed
- model_name: "code-model"
litellm_params:
model: openai/Qwen/Qwen2.5-Coder-32B-Instruct
api_base: http://vllm-coder:8000/v1
api_key: not-needed
router_settings:
routing_strategy: "least-busy" # route to least loaded instance
fallbacks:
- {"smart-model": ["fast-model"]} # fallback if 70B is overloaded
Now your clients talk to http://localhost:4000/v1 and can choose any model by name. The router handles load balancing and fallbacks.
Pattern 4: Adding Authentication¶
By default, vLLM has no authentication. For production:
# auth_proxy.py — simple FastAPI proxy with API key auth
from fastapi import FastAPI, HTTPException, Depends, Request
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import httpx
app = FastAPI()
security = HTTPBearer()
VALID_KEYS = {"sk-user1-key-here", "sk-user2-key-here"}
VLLM_BASE = "http://localhost:8000"
def verify_key(credentials: HTTPAuthorizationCredentials = Depends(security)):
if credentials.credentials not in VALID_KEYS:
raise HTTPException(status_code=401, detail="Invalid API key")
@app.api_route("/{path:path}", methods=["GET", "POST", "DELETE"])
async def proxy(request: Request, path: str, _=Depends(verify_key)):
async with httpx.AsyncClient() as client:
response = await client.request(
method=request.method,
url=f"{VLLM_BASE}/{path}",
content=await request.body(),
headers={k: v for k, v in request.headers.items()
if k.lower() not in ("host", "authorization")},
timeout=300.0,
)
return response.json()
For production, consider dedicated solutions: LiteLLM Proxy (built-in auth, rate limiting, budget management) or Kong / Nginx in front.
Part 7: Advanced Features¶
Structured Output (JSON Mode)¶
Force the model to output valid JSON matching a schema — critical for production applications:
from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
# Define the schema you want
product_schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"in_stock": {"type": "boolean"},
"tags": {
"type": "array",
"items": {"type": "string"}
}
},
"required": ["name", "price", "in_stock"]
}
params = SamplingParams(
temperature=0.1,
max_tokens=200,
guided_decoding=GuidedDecodingParams(json=product_schema)
)
output = llm.generate(
["Extract product info: 'The Nike Air Max 90 costs $150 and is available in stores. Tags: shoes, running, nike'"],
params
)
import json
result = json.loads(output[0].outputs[0].text)
print(result)
# {"name": "Nike Air Max 90", "price": 150.0, "in_stock": true, "tags": ["shoes", "running", "nike"]}
Or via the OpenAI-compatible API:
from openai import OpenAI
from pydantic import BaseModel
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
class Product(BaseModel):
name: str
price: float
in_stock: bool
tags: list[str]
response = client.beta.chat.completions.parse(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[
{"role": "user",
"content": "Extract: 'The Nike Air Max 90 costs $150 and is available.'"}
],
response_format=Product,
)
product = response.choices[0].message.parsed
print(f"{product.name}: ${product.price}")
Tool Calling / Function Calling¶
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["city"]
}
}
}
]
response = client.chat.completions.create(
model="meta-llama/Meta-Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What's the weather in Bangkok?"}],
tools=tools,
tool_choice="auto"
)
# Check if the model called a tool
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
print(f"Model called: {function_name}({function_args})")
# → Model called: get_weather({"city": "Bangkok", "unit": "celsius"})
LoRA Adapter Serving¶
LoRA (Low-Rank Adaptation) lets you fine-tune a model on custom data with minimal compute. vLLM can serve multiple LoRA adapters on top of a single base model:
# Start server with LoRA support
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--enable-lora \
--max-loras 4 \ # max adapters loaded simultaneously
--max-lora-rank 64 \ # max LoRA rank to support
--lora-modules \
customer-service=/path/to/cs-lora \
legal-qa=/path/to/legal-lora \
coding=/path/to/coding-lora
Call a specific adapter:
# Use the customer-service LoRA adapter
response = client.chat.completions.create(
model="customer-service", # ← adapter name, not base model
messages=[{"role": "user", "content": "How do I return a product?"}]
)
This is extremely cost-effective: one GPU running one base model, but serving multiple fine-tuned variants with different specializations.
Part 8: Monitoring and Performance Tuning¶
Built-in Prometheus Metrics¶
vLLM exposes metrics at /metrics:
Key metrics to watch:
# Request metrics
vllm:request_success_total # successful requests
vllm:request_prompt_tokens_total # input tokens processed
vllm:request_generation_tokens_total # output tokens generated
vllm:e2e_request_latency_seconds # end-to-end latency histogram
vllm:time_to_first_token_seconds # latency to first token (TTFT)
vllm:time_per_output_token_seconds # inter-token latency (ITL)
# System metrics
vllm:gpu_cache_usage_perc # KV cache utilization (aim for 70-90%)
vllm:num_requests_running # currently being processed
vllm:num_requests_waiting # queued, waiting for KV cache space
vllm:num_requests_swapped # swapped to CPU (bad — means OOM pressure)
Grafana dashboard — import community dashboards from grafana.com/grafana/dashboards/?search=vllm.
Prometheus + Grafana Setup¶
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm:8000']
metrics_path: '/metrics'
Performance Benchmarking¶
vLLM ships with a benchmarking tool:
# Benchmark throughput
python -m vllm.entrypoints.openai.run_bench_throughput \
--backend openai-chat \
--base-url http://localhost:8000 \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000 \
--request-rate 10 # requests per second
# Benchmark latency
python -m vllm.entrypoints.openai.run_bench_latency \
--backend openai-chat \
--base-url http://localhost:8000 \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--num-prompts 100 \
--input-len 512 \
--output-len 128
Expected results on A10G (24 GB) with LLaMA 3 8B:
Throughput benchmark:
Total time: 52.3 s
Throughput: 19.12 requests/s
Output token throughput: 2,547 tokens/s
Latency benchmark (p50/p90/p99):
Time to first token: 45ms / 89ms / 234ms
Inter-token latency: 12ms / 15ms / 21ms
End-to-end latency: 1.2s / 2.3s / 4.1s (for 128 output tokens)
Performance Tuning Checklist¶
GPU Utilization (aim for >85%):
☐ Use --gpu-memory-utilization 0.90 (or 0.95 if stable)
☐ Enable prefix caching: --enable-prefix-caching
☐ Use chunked prefill: --enable-chunked-prefill
☐ Tune --max-num-seqs (default 256) for your concurrency needs
Memory Efficiency:
☐ Set --max-model-len to actual maximum you need (not model's max)
☐ Use quantization (AWQ/GPTQ) if VRAM is constrained
☐ Use --cpu-offload-gb to spill KV cache to RAM (last resort)
Throughput:
☐ Send requests in batches when possible (offline scenarios)
☐ Use --max-num-batched-tokens to control memory per iteration
☐ Enable speculative decoding for predictable text workloads
Latency (interactive use):
☐ Use streaming (stream=True) so users see first tokens fast
☐ Preload the model (warmup) with a dummy request at startup
☐ Use smaller model with quantization for lowest latency
☐ Consider --preemption-mode abort vs recompute tradeoff
Part 9: Recommendations by Use Case¶
Use Case Matrix¶
| Scenario | Recommended Setup | Key Flags |
|---|---|---|
| Local development | Single GPU, no server | LLM() class directly in Python |
| Team API (< 10 users) | OpenAI server + 1 GPU | --max-model-len 4096 |
| Production (100s users) | Docker/K8s + monitoring | --enable-prefix-caching --tensor-parallel-size N |
| Cost-sensitive | Quantized model (AWQ) | --quantization awq + smaller model |
| Low latency priority | Speculative decoding | --speculative-model |
| Multiple fine-tuned variants | LoRA serving | --enable-lora --lora-modules ... |
| Structured output (JSON) | Guided decoding | guided_decoding=GuidedDecodingParams(json=...) |
| Very long context (100K+) | Multi-GPU | --tensor-parallel-size 4 --max-model-len 131072 |
Model Selection Guide¶
For BEST quality (if GPU budget allows):
70B model + vLLM → near-GPT-4 quality for most tasks
For BEST throughput per dollar:
8B AWQ model → 4× the throughput of 8B fp16 at same quality loss
For LOWEST latency:
3B–7B model with speculative decoding
For CODE generation:
Qwen2.5-Coder-32B-Instruct or DeepSeek-Coder-V2
For MULTILINGUAL (including Thai):
Qwen2.5-72B-Instruct (excellent multilingual support)
SeaLLM for Southeast Asian languages specifically
For STRUCTURED output heavy workloads:
Models fine-tuned on instruction following + JSON:
Hermes-3-Llama-3.1-8B, Nous-Hermes-2-Mixtral
Cost Estimation¶
# Quick cost calculator for self-hosted vLLM
def estimate_monthly_cost(
requests_per_day: int,
avg_input_tokens: int,
avg_output_tokens: int,
gpu_cost_per_hour: float, # e.g., $2.50/hr for A10G on AWS
throughput_tokens_per_sec: float # from your benchmark
) -> dict:
tokens_per_day = requests_per_day * (avg_input_tokens + avg_output_tokens)
tokens_per_month = tokens_per_day * 30
# GPU hours needed
seconds_needed_per_day = (requests_per_day * avg_output_tokens) / throughput_tokens_per_sec
gpu_hours_per_day = seconds_needed_per_day / 3600
gpu_hours_per_month = gpu_hours_per_day * 30
# Assume 24/7 instance (always-on for availability)
monthly_cost_always_on = 24 * 30 * gpu_cost_per_hour
# Equivalent OpenAI cost (GPT-4o: $5/1M input, $15/1M output)
openai_cost = (
(tokens_per_month / 1_000_000 * 5) # input
+ (tokens_per_month * avg_output_tokens / (avg_input_tokens + avg_output_tokens) / 1_000_000 * 15) # output
)
return {
"tokens_per_month": f"{tokens_per_month:,}",
"gpu_hours_if_scaled": f"{gpu_hours_per_month:.0f} hours",
"monthly_cost_always_on": f"${monthly_cost_always_on:.0f}",
"openai_equivalent_cost": f"${openai_cost:.0f}",
"savings_vs_openai": f"${openai_cost - monthly_cost_always_on:.0f}"
}
# Example: 10,000 requests/day, 500 input tokens, 300 output tokens
# A10G at $2.50/hr, throughput 2,000 tokens/sec
result = estimate_monthly_cost(
requests_per_day=10_000,
avg_input_tokens=500,
avg_output_tokens=300,
gpu_cost_per_hour=2.50,
throughput_tokens_per_sec=2000
)
for k, v in result.items():
print(f"{k}: {v}")
tokens_per_month: 240,000,000
gpu_hours_if_scaled: 450 hours
monthly_cost_always_on: $1,800
openai_equivalent_cost: $4,500
savings_vs_openai: $2,700
Common Pitfalls and How to Avoid Them¶
| Pitfall | Symptom | Fix |
|---|---|---|
| Out of memory at startup | CUDA OOM during model load | Use quantization, or --tensor-parallel-size 2 |
| OOM during inference | CUDA OOM under load | Lower --gpu-memory-utilization to 0.80 |
| Requests timing out | 504 errors under load | Increase max concurrency, add more instances |
| Slow first request | 30+ second latency on cold start | Warm up with a dummy request on startup |
| Wrong model format | Model incompatible errors | Check vLLM supported models list |
| High TTFT (time to first token) | Interactive feel is bad | Reduce max input length, enable chunked prefill |
| Model gives wrong answers | Quality regression vs. original | Test a non-quantized version to confirm |
| No metrics | Can't monitor production | Mount /metrics to Prometheus |
Summary¶
# 1. Install
pip install vllm
# 2. Offline inference (Python script)
python -c "
from vllm import LLM, SamplingParams
llm = LLM('Qwen/Qwen2.5-7B-Instruct')
out = llm.generate(['Hello, who are you?'], SamplingParams(max_tokens=100))
print(out[0].outputs[0].text)
"
# 3. OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--port 8000
# 4. Test the server
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hi!"}]}'
# 5. Use with OpenAI SDK
python -c "
from openai import OpenAI
c = OpenAI(base_url='http://localhost:8000/v1', api_key='x')
r = c.chat.completions.create(model='Qwen/Qwen2.5-7B-Instruct', messages=[{'role':'user','content':'Hi!'}])
print(r.choices[0].message.content)
"
# 6. Production: Docker with GPU
docker run --runtime nvidia --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-7B-Instruct-AWQ \
--quantization awq \
--enable-prefix-caching
vLLM has become the de facto standard for open-source LLM serving because it solves the real-world problems — memory efficiency, throughput, and drop-in compatibility — without requiring you to rewrite your application. Start with the OpenAI-compatible server mode, measure your throughput with the benchmark tool, then tune from there. The jump from naïve model.generate() to vLLM is one of the highest-return investments you can make when moving an LLM application from prototype to production.
Questions or discussion? Connect on LinkedIn, X or reach out via email.
Discussion
Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.