vLLM: Production LLM Serving from Zero to Scale¶

You've downloaded a large language model. You've got it running. But you notice something uncomfortable: it's slow, it can only handle one request at a time, and your GPU is mysteriously underutilized. The moment two people try to use your model at the same time, one of them waits — and waits.

This is the LLM serving problem, and vLLM is the most widely adopted open-source solution to it.

Part 1: The Problem vLLM Solves¶

Why Naïve LLM Serving Is Slow¶

When you load a model with model.generate(...), a few things happen behind the scenes that make it unsuitable for serving multiple users:

Problem 1: Memory Waste from Static KV Cache Allocation¶

Every time a model processes input tokens, it computes Key-Value (KV) tensors for the attention mechanism. These tensors are large and need to be stored in GPU memory throughout the generation.

A request with 512 token context on a 7B model:
  KV cache size = 2 × num_layers × num_heads × head_dim × seq_len × dtype_bytes
                = 2 × 32 × 32 × 128 × 512 × 2 bytes
                = ~268 MB per request

If you pre-allocate for the maximum sequence length (4096):
  268 MB × (4096/512) = 2.1 GB — even if the actual output is 50 tokens!

Naïve frameworks allocate the maximum possible memory upfront. Most of it sits empty and wasted.

Problem 2: No Concurrent Request Handling¶

Standard inference is sequential: process request 1 completely, then request 2, then request 3.

Naïve server:
  Time →
  GPU:  [===Req1===][===Req2===][===Req3===]
         40s         40s         40s         Total: 120s
         User 1 OK   User 2 waits 40s   User 3 waits 80s

The GPU sits partially idle whenever it's waiting for the next batch of tokens to generate.

Problem 3: Memory Fragmentation¶

Even if you try to batch requests, different requests have different lengths. Memory reserved for one long request can't be used by three short requests.

GPU Memory (24 GB):
  [====Request A (2048 max)====][====Request B (2048 max)====][  12 GB free  ]
  ^^ But Request A only uses 200 tokens! 1.8 GB wasted ^^

What vLLM Achieves¶

vLLM addresses all three problems and delivers:

Metric	Naïve serving	vLLM
Throughput	1–3 req/s (7B)	20–50+ req/s (7B)
GPU Memory Efficiency	40–60% utilized	90–95% utilized
Concurrent users	1	Hundreds
Latency (first token)	Same	Same or faster
Latency (full response)	Baseline	10–24× better throughput

Part 2: How vLLM Works — The Core Innovations¶

Innovation 1: PagedAttention¶

PagedAttention is vLLM's headline invention, introduced in a 2023 UC Berkeley paper. It borrows an idea from operating systems: virtual memory paging.

In a traditional OS, physical RAM is divided into fixed-size pages. A program's memory isn't physically contiguous — it's scattered across pages, and a page table maps virtual addresses to physical pages. This eliminates fragmentation.

PagedAttention applies the same idea to the KV cache:

Traditional KV Cache (pre-vLLM):
  GPU Memory:
  ┌──────────────────────────────────────────────────────┐
  │ Request A: [tok0][tok1][tok2][EMPTY][EMPTY][EMPTY]   │ ← 50% wasted
  │ Request B: [tok0][tok1][EMPTY][EMPTY][EMPTY][EMPTY]  │ ← 67% wasted
  │ Request C: NO ROOM — must wait                       │
  └──────────────────────────────────────────────────────┘

PagedAttention KV Cache:
  GPU Memory divided into fixed-size BLOCKS (e.g., 16 tokens each):
  ┌────┬────┬────┬────┬────┬────┬────┬────┬────┬────┐
  │ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │ B8 │ B9 │
  └────┴────┴────┴────┴────┴────┴────┴────┴────┴────┘

  Request A uses blocks: B0, B2, B7        (only what it needs)
  Request B uses blocks: B1, B4            (only what it needs)
  Request C uses blocks: B3, B5, B8, B9   (fits in the gaps!)

  Block table maps logical blocks to physical blocks:
  Request A → [B0 at pos 0, B2 at pos 1, B7 at pos 2]

Key benefits: - Near-zero internal fragmentation (< 4% wasted per block) - Multiple requests can share blocks for identical prefixes (prompt caching) - New requests can start immediately as blocks become available

Innovation 2: Continuous Batching¶

Naïve batching waits for all requests in a batch to finish before starting new ones. This is wasteful because requests finish at different times:

Static batching:
  Batch 1: [Req A ===30tok===][Req B ======50tok======]
  Batch 2: starts only after BOTH A and B finish (50 tok wait)
  GPU idle during last 20 tokens of B waiting for A!

Continuous batching (vLLM):
  Iteration 0: [A][B][C]
  Iteration 1: [A][B][C]
  ...
  Iteration 30: [A finishes][B][C]  → immediately swap in [D]!
  Iteration 31: [D][B][C]           ← no gap
  ...
  GPU is always doing useful work

This is sometimes called "iteration-level scheduling" — the scheduler makes decisions at every single token generation step, not at the batch level.

Innovation 3: Prefix Caching (Prompt Caching)¶

When multiple requests share the same system prompt or context, vLLM can cache the KV blocks for the shared prefix and reuse them across requests — no recomputation needed.

System prompt: "You are a helpful customer service agent for AcmeCorp..."
              [=============== 500 tokens ===============]

Request 1: [shared prefix 500 tok] + "What are your hours?"
Request 2: [shared prefix 500 tok] + "How do I return a product?"
Request 3: [shared prefix 500 tok] + "Where is my order?"

Without prefix caching: compute 500-token prefix 3 times
With prefix caching:    compute 500-token prefix ONCE, reuse KV cache
Speedup: ~35% for typical RAG/chat applications

Innovation 4: Speculative Decoding¶

Speculative decoding uses a small "draft" model to generate multiple candidate tokens quickly, then verifies them with the large model in parallel:

Normal generation (LLaMA 70B):
  Generate token 1 → Generate token 2 → Generate token 3 ...
  Each step: full 70B forward pass

Speculative decoding:
  Draft model (1B) generates 5 candidate tokens: ["The", "cat", "sat", "on", "the"]
  Large model (70B) verifies all 5 in ONE forward pass
  Accept tokens up to first mismatch, reject rest, continue

  Result: 2–4× more tokens per second for easy/predictable text

Part 3: Installation and First Steps¶

Prerequisites¶

Requirement	Minimum	Recommended
GPU	NVIDIA (CUDA 11.8+)	NVIDIA (CUDA 12.x)
VRAM	8 GB (7B model at fp16... tight)	24 GB (7B comfortable), 80 GB (70B)
RAM	32 GB	64 GB+
Python	3.9	3.11+
OS	Linux	Linux (Ubuntu 22.04)
Driver	CUDA 11.8	CUDA 12.4+

VRAM Requirements

vLLM loads the full model weights into GPU VRAM. A 7B parameter model at float16 needs roughly 14 GB VRAM just for weights, plus KV cache overhead. Plan for:

7B model → 16–24 GB GPU (RTX 4090, A10G, L4)
13B model → 28–40 GB GPU (A100 40GB)
70B model → 2× A100 80GB, or 4× A10G

Installation¶

pip (standard)pip (CUDA 11.8)Docker (recommended for production)uv (fastest)

# Create a virtual environment first
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM (CUDA 12.1 build)
pip install vllm

# Verify installation
python -c "import vllm; print(vllm.__version__)"

# If you have CUDA 11.8 (older systems)
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118

# Official vLLM Docker image — no dependency headaches
docker pull vllm/vllm-openai:latest

# Run with GPU access
docker run --runtime nvidia --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model meta-llama/Meta-Llama-3-8B-Instruct

# uv is much faster than pip for installing large packages
uv venv vllm-env
source vllm-env/bin/activate
uv pip install vllm

Your First Inference¶

# hello_vllm.py
from vllm import LLM, SamplingParams

# Load model — this downloads from HuggingFace on first run
# (set HF_TOKEN env var for gated models like LLaMA)
llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")

# Sampling parameters
params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256
)

# Generate — notice we pass a LIST of prompts (batching!)
prompts = [
    "What is PagedAttention in vLLM?",
    "Write a Python function to check if a number is prime.",
    "Explain the concept of entropy in information theory.",
]

outputs = llm.generate(prompts, params)

for output in outputs:
    prompt = output.prompt
    response = output.outputs[0].text
    print(f"Prompt: {prompt!r}")
    print(f"Response: {response}")
    print("─" * 60)

Run it:

python hello_vllm.py

You'll see vLLM process all three prompts in a single batched pass — far more efficient than calling generate three times separately.

Part 4: The OpenAI-Compatible Server¶

This is vLLM's killer feature for most teams. By exposing an OpenAI-compatible REST API, you can point any existing tool or SDK at your vLLM server and it just works — no code changes needed.

Starting the Server¶

# Basic startup
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000

# With common options
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \          # context window
    --tensor-parallel-size 2 \      # use 2 GPUs
    --gpu-memory-utilization 0.90 \ # use 90% of GPU VRAM
    --enable-prefix-caching \       # enable prompt caching
    --served-model-name "my-llama" # custom model name for API

Calling from Python¶

openai library (drop-in replacement)Streaming responsescurlhttpx (async)

from openai import OpenAI

# Point to your vLLM server instead of api.openai.com
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require a key by default
)

# Chat completions — identical to calling OpenAI API
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of Thailand?"}
    ],
    temperature=0.7,
    max_tokens=200
)

print(response.choices[0].message.content)
print(f"Usage: {response.usage}")

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# Stream the response token by token
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "Tell me a short story."}],
    stream=True  # ← enable streaming
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()  # newline at end

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Meta-Llama-3-8B-Instruct",
        "messages": [
            {"role": "user", "content": "What is vLLM?"}
        ],
        "temperature": 0.7,
        "max_tokens": 200
    }'

import asyncio
import httpx

async def chat(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/v1/chat/completions",
            json={
                "model": "meta-llama/Meta-Llama-3-8B-Instruct",
                "messages": [{"role": "user", "content": prompt}],
                "max_tokens": 300,
            },
            timeout=60.0
        )
        return response.json()["choices"][0]["message"]["content"]

async def main():
    # Run 5 requests concurrently — vLLM handles them all
    results = await asyncio.gather(*[
        chat(f"Question {i}: What is {topic}?")
        for i, topic in enumerate(["Python", "Kubernetes", "vLLM", "LLM", "DevOps"])
    ])
    for r in results:
        print(r[:100], "...\n")

asyncio.run(main())

Checking Server Health and Models¶

# List available models
curl http://localhost:8000/v1/models | python -m json.tool

# Health check
curl http://localhost:8000/health

# Server metrics (Prometheus format)
curl http://localhost:8000/metrics

Part 5: Key Configuration Options Explained¶

Every configuration flag affects either quality, speed, or memory. Understanding the trade-offs helps you tune for your workload.

GPU Memory Utilization¶

--gpu-memory-utilization 0.90  # default: 0.90

Controls what fraction of GPU VRAM vLLM reserves for the KV cache. The rest goes to model weights.

GPU VRAM = 24 GB
Model weights (7B fp16) ≈ 14 GB

With --gpu-memory-utilization 0.90:
  Reserved = 24 × 0.90 = 21.6 GB
  KV cache = 21.6 - 14 = 7.6 GB → supports ~180 concurrent 512-token requests

With --gpu-memory-utilization 0.70:
  Reserved = 24 × 0.70 = 16.8 GB
  KV cache = 16.8 - 14 = 2.8 GB → supports ~66 concurrent requests

If you see OOM errors → lower this value
If you want more concurrent users → raise this value (carefully)

Max Model Length¶

--max-model-len 4096  # limit context window

A smaller context window = smaller maximum KV cache = more requests fit simultaneously.

If your users never need more than 4K tokens context:
  Setting --max-model-len 4096 on a model that supports 128K
  means the KV cache is sized for 4K, not 128K
  → 32× more efficient use of KV cache memory!

Tensor Parallelism (Multi-GPU)¶

--tensor-parallel-size 2  # split model across 2 GPUs
--tensor-parallel-size 4  # split across 4 GPUs

Tensor parallelism splits the model's weight matrices across multiple GPUs. Each GPU holds a shard of every layer and they communicate via NVLink or PCIe.

Single GPU (A100 80GB): Can serve LLaMA 70B with tight KV cache
2× GPU (A100 80GB):     Comfortable 70B, large KV cache, faster
4× GPU (A100 80GB):     Even faster, large context windows (128K+)
8× GPU (A100 80GB):     405B models

When to use tensor parallelism

Model doesn't fit on one GPU → use TP
Model fits but you need more throughput → try data parallelism (multiple server instances) first — it has less overhead than TP

For the full decision tree — TP vs PP vs DP, the per-layer AllReduce cost, multi-node Ray setup, and Expert Parallelism for MoE models — see Scaling LLM Inference: DP, PP, and TP.

Quantization¶

Quantization reduces model precision to use less memory and run faster:

AWQ (recommended)GPTQFP8 (H100/H200 only)BitsAndBytes (most flexible)

# Use a pre-quantized model from HuggingFace
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct-AWQ \
    --quantization awq

AWQ (Activation-aware Weight Quantization) quantizes to 4-bit with minimal accuracy loss. A 7B AWQ model uses ~4 GB VRAM instead of ~14 GB.

python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-7B-chat-GPTQ \
    --quantization gptq

# FP8 is fastest on Hopper GPUs (H100, H200)
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --quantization fp8

# Load any model in 4-bit or 8-bit on the fly (no pre-quantized model needed)
llm = LLM(
    model="meta-llama/Meta-Llama-3-70B-Instruct",
    quantization="bitsandbytes",
    load_format="bitsandbytes"
)

Quantization comparison:

Method	Bits	Quality Loss	VRAM vs FP16	Speed
FP16	16	None (baseline)	1×	Baseline
BF16	16	None	1×	Same
FP8	8	Minimal	0.5×	1.5–2× faster
GPTQ	4	Small	0.25×	1.2× faster
AWQ	4	Minimal (better calibration)	0.25×	1.2× faster

Sampling Parameters Deep Dive¶

from vllm import SamplingParams

params = SamplingParams(
    # Core sampling
    temperature=0.7,      # 0 = deterministic, >1 = more random
    top_p=0.9,           # nucleus sampling: top tokens summing to 90% probability
    top_k=50,            # only consider top 50 tokens

    # Length control
    max_tokens=512,       # maximum tokens to generate
    min_tokens=10,        # minimum before EOS is allowed

    # Repetition control
    presence_penalty=0.1, # penalize tokens that already appeared
    frequency_penalty=0.1,# penalize tokens proportional to how often they appeared
    repetition_penalty=1.1,# multiplicative penalty (>1 reduces repetition)

    # Stopping conditions
    stop=["<|eot_id|>", "Human:", "User:"],  # stop on these strings
    stop_token_ids=[128009],                  # stop on these token IDs

    # Multiple outputs
    n=3,                  # generate 3 different completions
    best_of=5,           # generate 5 internally, return the 3 highest-scoring

    # Determinism
    seed=42,             # for reproducible outputs
)

Part 6: Practical Deployment Patterns¶

Pattern 1: Simple Docker Deployment¶

# docker-compose.yml
version: "3.8"
services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HF_TOKEN=${HF_TOKEN}  # for gated models like LLaMA
      - HF_HOME=/cache/huggingface
    volumes:
      - huggingface-cache:/cache/huggingface
    ports:
      - "8000:8000"
    command: >
      --model meta-llama/Meta-Llama-3-8B-Instruct
      --host 0.0.0.0
      --port 8000
      --max-model-len 8192
      --gpu-memory-utilization 0.90
      --enable-prefix-caching
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 120s  # model loading takes time!

volumes:
  huggingface-cache:

# Start
HF_TOKEN=your_token docker compose up -d

# Check logs
docker compose logs -f vllm

# Test
curl http://localhost:8000/v1/models

Pattern 2: Kubernetes Deployment¶

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
  namespace: ai
spec:
  replicas: 1  # scale based on load
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Meta-Llama-3-8B-Instruct"
        - "--host"
        - "0.0.0.0"
        - "--port"
        - "8000"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--enable-prefix-caching"
        ports:
        - containerPort: 8000
        env:
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-credentials
              key: token
        resources:
          requests:
            nvidia.com/gpu: "1"
            memory: "32Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: "1"
            memory: "48Gi"
            cpu: "8"
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 180
          periodSeconds: 30
        volumeMounts:
        - name: hf-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: hf-cache
        persistentVolumeClaim:
          claimName: hf-model-cache
      nodeSelector:
        accelerator: nvidia-a10g  # schedule on GPU nodes
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: ai
spec:
  selector:
    app: vllm-server
  ports:
  - port: 80
    targetPort: 8000
  type: ClusterIP

Pattern 3: Multiple Models with a Router¶

For serving multiple models, use a lightweight router (like LiteLLM) in front of multiple vLLM instances:

# litellm-config.yaml
model_list:
  - model_name: "fast-model"      # alias exposed to clients
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3-8B-Instruct
      api_base: http://vllm-8b:8000/v1
      api_key: not-needed

  - model_name: "smart-model"
    litellm_params:
      model: openai/meta-llama/Meta-Llama-3-70B-Instruct
      api_base: http://vllm-70b:8000/v1
      api_key: not-needed

  - model_name: "code-model"
    litellm_params:
      model: openai/Qwen/Qwen2.5-Coder-32B-Instruct
      api_base: http://vllm-coder:8000/v1
      api_key: not-needed

router_settings:
  routing_strategy: "least-busy"  # route to least loaded instance
  fallbacks:
    - {"smart-model": ["fast-model"]}  # fallback if 70B is overloaded

# Run the router
litellm --config litellm-config.yaml --port 4000

Now your clients talk to http://localhost:4000/v1 and can choose any model by name. The router handles load balancing and fallbacks.

Pattern 4: Adding Authentication¶

By default, vLLM has no authentication. For production:

# auth_proxy.py — simple FastAPI proxy with API key auth
from fastapi import FastAPI, HTTPException, Depends, Request
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
import httpx

app = FastAPI()
security = HTTPBearer()

VALID_KEYS = {"sk-user1-key-here", "sk-user2-key-here"}
VLLM_BASE = "http://localhost:8000"

def verify_key(credentials: HTTPAuthorizationCredentials = Depends(security)):
    if credentials.credentials not in VALID_KEYS:
        raise HTTPException(status_code=401, detail="Invalid API key")

@app.api_route("/{path:path}", methods=["GET", "POST", "DELETE"])
async def proxy(request: Request, path: str, _=Depends(verify_key)):
    async with httpx.AsyncClient() as client:
        response = await client.request(
            method=request.method,
            url=f"{VLLM_BASE}/{path}",
            content=await request.body(),
            headers={k: v for k, v in request.headers.items()
                     if k.lower() not in ("host", "authorization")},
            timeout=300.0,
        )
    return response.json()

For production, consider dedicated solutions: LiteLLM Proxy (built-in auth, rate limiting, budget management) or Kong / Nginx in front.

Part 7: Advanced Features¶

Structured Output (JSON Mode)¶

Force the model to output valid JSON matching a schema — critical for production applications:

from vllm import LLM, SamplingParams
from vllm.sampling_params import GuidedDecodingParams

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")

# Define the schema you want
product_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "in_stock": {"type": "boolean"},
        "tags": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "price", "in_stock"]
}

params = SamplingParams(
    temperature=0.1,
    max_tokens=200,
    guided_decoding=GuidedDecodingParams(json=product_schema)
)

output = llm.generate(
    ["Extract product info: 'The Nike Air Max 90 costs $150 and is available in stores. Tags: shoes, running, nike'"],
    params
)

import json
result = json.loads(output[0].outputs[0].text)
print(result)
# {"name": "Nike Air Max 90", "price": 150.0, "in_stock": true, "tags": ["shoes", "running", "nike"]}

Or via the OpenAI-compatible API:

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

class Product(BaseModel):
    name: str
    price: float
    in_stock: bool
    tags: list[str]

response = client.beta.chat.completions.parse(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[
        {"role": "user",
         "content": "Extract: 'The Nike Air Max 90 costs $150 and is available.'"}
    ],
    response_format=Product,
)

product = response.choices[0].message.parsed
print(f"{product.name}: ${product.price}")

Tool Calling / Function Calling¶

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["city"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Bangkok?"}],
    tools=tools,
    tool_choice="auto"
)

# Check if the model called a tool
message = response.choices[0].message
if message.tool_calls:
    tool_call = message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)
    print(f"Model called: {function_name}({function_args})")
    # → Model called: get_weather({"city": "Bangkok", "unit": "celsius"})

LoRA Adapter Serving¶

LoRA (Low-Rank Adaptation) lets you fine-tune a model on custom data with minimal compute. vLLM can serve multiple LoRA adapters on top of a single base model:

# Start server with LoRA support
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --enable-lora \
    --max-loras 4 \                    # max adapters loaded simultaneously
    --max-lora-rank 64 \               # max LoRA rank to support
    --lora-modules \
        customer-service=/path/to/cs-lora \
        legal-qa=/path/to/legal-lora \
        coding=/path/to/coding-lora

Call a specific adapter:

# Use the customer-service LoRA adapter
response = client.chat.completions.create(
    model="customer-service",  # ← adapter name, not base model
    messages=[{"role": "user", "content": "How do I return a product?"}]
)

This is extremely cost-effective: one GPU running one base model, but serving multiple fine-tuned variants with different specializations.

Part 8: Monitoring and Performance Tuning¶

Built-in Prometheus Metrics¶

vLLM exposes metrics at /metrics:

curl http://localhost:8000/metrics

Key metrics to watch:

# Request metrics
vllm:request_success_total           # successful requests
vllm:request_prompt_tokens_total     # input tokens processed
vllm:request_generation_tokens_total # output tokens generated
vllm:e2e_request_latency_seconds     # end-to-end latency histogram
vllm:time_to_first_token_seconds     # latency to first token (TTFT)
vllm:time_per_output_token_seconds   # inter-token latency (ITL)

# System metrics
vllm:gpu_cache_usage_perc            # KV cache utilization (aim for 70-90%)
vllm:num_requests_running            # currently being processed
vllm:num_requests_waiting            # queued, waiting for KV cache space
vllm:num_requests_swapped            # swapped to CPU (bad — means OOM pressure)

Grafana dashboard — import community dashboards from grafana.com/grafana/dashboards/?search=vllm.

Prometheus + Grafana Setup¶

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['vllm:8000']
    metrics_path: '/metrics'

Performance Benchmarking¶

vLLM ships with a benchmarking tool:

# Benchmark throughput
python -m vllm.entrypoints.openai.run_bench_throughput \
    --backend openai-chat \
    --base-url http://localhost:8000 \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1000 \
    --request-rate 10   # requests per second

# Benchmark latency
python -m vllm.entrypoints.openai.run_bench_latency \
    --backend openai-chat \
    --base-url http://localhost:8000 \
    --model meta-llama/Meta-Llama-3-8B-Instruct \
    --num-prompts 100 \
    --input-len 512 \
    --output-len 128

Expected results on A10G (24 GB) with LLaMA 3 8B:

Throughput benchmark:
  Total time: 52.3 s
  Throughput: 19.12 requests/s
  Output token throughput: 2,547 tokens/s

Latency benchmark (p50/p90/p99):
  Time to first token:  45ms / 89ms / 234ms
  Inter-token latency:  12ms / 15ms / 21ms
  End-to-end latency:  1.2s / 2.3s / 4.1s (for 128 output tokens)

Performance Tuning Checklist¶

GPU Utilization (aim for >85%):
  ☐ Use --gpu-memory-utilization 0.90 (or 0.95 if stable)
  ☐ Enable prefix caching: --enable-prefix-caching
  ☐ Use chunked prefill: --enable-chunked-prefill
  ☐ Tune --max-num-seqs (default 256) for your concurrency needs

Memory Efficiency:
  ☐ Set --max-model-len to actual maximum you need (not model's max)
  ☐ Use quantization (AWQ/GPTQ) if VRAM is constrained
  ☐ Use --cpu-offload-gb to spill KV cache to RAM (last resort)

Throughput:
  ☐ Send requests in batches when possible (offline scenarios)
  ☐ Use --max-num-batched-tokens to control memory per iteration
  ☐ Enable speculative decoding for predictable text workloads

Latency (interactive use):
  ☐ Use streaming (stream=True) so users see first tokens fast
  ☐ Preload the model (warmup) with a dummy request at startup
  ☐ Use smaller model with quantization for lowest latency
  ☐ Consider --preemption-mode abort vs recompute tradeoff

Part 9: Recommendations by Use Case¶

Use Case Matrix¶

Scenario	Recommended Setup	Key Flags
Local development	Single GPU, no server	`LLM()` class directly in Python
Team API (< 10 users)	OpenAI server + 1 GPU	`--max-model-len 4096`
Production (100s users)	Docker/K8s + monitoring	`--enable-prefix-caching --tensor-parallel-size N`
Cost-sensitive	Quantized model (AWQ)	`--quantization awq` + smaller model
Low latency priority	Speculative decoding	`--speculative-model`
Multiple fine-tuned variants	LoRA serving	`--enable-lora --lora-modules ...`
Structured output (JSON)	Guided decoding	`guided_decoding=GuidedDecodingParams(json=...)`
Very long context (100K+)	Multi-GPU	`--tensor-parallel-size 4 --max-model-len 131072`

Model Selection Guide¶

For BEST quality (if GPU budget allows):
  70B model + vLLM → near-GPT-4 quality for most tasks

For BEST throughput per dollar:
  8B AWQ model → 4× the throughput of 8B fp16 at same quality loss

For LOWEST latency:
  3B–7B model with speculative decoding

For CODE generation:
  Qwen2.5-Coder-32B-Instruct or DeepSeek-Coder-V2

For MULTILINGUAL (including Thai):
  Qwen2.5-72B-Instruct (excellent multilingual support)
  SeaLLM for Southeast Asian languages specifically

For STRUCTURED output heavy workloads:
  Models fine-tuned on instruction following + JSON:
  Hermes-3-Llama-3.1-8B, Nous-Hermes-2-Mixtral

Cost Estimation¶

# Quick cost calculator for self-hosted vLLM
def estimate_monthly_cost(
    requests_per_day: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    gpu_cost_per_hour: float,  # e.g., $2.50/hr for A10G on AWS
    throughput_tokens_per_sec: float  # from your benchmark
) -> dict:

    tokens_per_day = requests_per_day * (avg_input_tokens + avg_output_tokens)
    tokens_per_month = tokens_per_day * 30

    # GPU hours needed
    seconds_needed_per_day = (requests_per_day * avg_output_tokens) / throughput_tokens_per_sec
    gpu_hours_per_day = seconds_needed_per_day / 3600
    gpu_hours_per_month = gpu_hours_per_day * 30

    # Assume 24/7 instance (always-on for availability)
    monthly_cost_always_on = 24 * 30 * gpu_cost_per_hour

    # Equivalent OpenAI cost (GPT-4o: $5/1M input, $15/1M output)
    openai_cost = (
        (tokens_per_month / 1_000_000 * 5)    # input
        + (tokens_per_month * avg_output_tokens / (avg_input_tokens + avg_output_tokens) / 1_000_000 * 15)  # output
    )

    return {
        "tokens_per_month": f"{tokens_per_month:,}",
        "gpu_hours_if_scaled": f"{gpu_hours_per_month:.0f} hours",
        "monthly_cost_always_on": f"${monthly_cost_always_on:.0f}",
        "openai_equivalent_cost": f"${openai_cost:.0f}",
        "savings_vs_openai": f"${openai_cost - monthly_cost_always_on:.0f}"
    }

# Example: 10,000 requests/day, 500 input tokens, 300 output tokens
# A10G at $2.50/hr, throughput 2,000 tokens/sec
result = estimate_monthly_cost(
    requests_per_day=10_000,
    avg_input_tokens=500,
    avg_output_tokens=300,
    gpu_cost_per_hour=2.50,
    throughput_tokens_per_sec=2000
)
for k, v in result.items():
    print(f"{k}: {v}")

tokens_per_month: 240,000,000
gpu_hours_if_scaled: 450 hours
monthly_cost_always_on: $1,800
openai_equivalent_cost: $4,500
savings_vs_openai: $2,700

Common Pitfalls and How to Avoid Them¶

Pitfall	Symptom	Fix
Out of memory at startup	`CUDA OOM` during model load	Use quantization, or `--tensor-parallel-size 2`
OOM during inference	`CUDA OOM` under load	Lower `--gpu-memory-utilization` to 0.80
Requests timing out	504 errors under load	Increase max concurrency, add more instances
Slow first request	30+ second latency on cold start	Warm up with a dummy request on startup
Wrong model format	Model incompatible errors	Check vLLM supported models list
High TTFT (time to first token)	Interactive feel is bad	Reduce max input length, enable chunked prefill
Model gives wrong answers	Quality regression vs. original	Test a non-quantized version to confirm
No metrics	Can't monitor production	Mount `/metrics` to Prometheus

Summary¶

# 1. Install
pip install vllm

# 2. Offline inference (Python script)
python -c "
from vllm import LLM, SamplingParams
llm = LLM('Qwen/Qwen2.5-7B-Instruct')
out = llm.generate(['Hello, who are you?'], SamplingParams(max_tokens=100))
print(out[0].outputs[0].text)
"

# 3. OpenAI-compatible server
python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --port 8000

# 4. Test the server
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hi!"}]}'

# 5. Use with OpenAI SDK
python -c "
from openai import OpenAI
c = OpenAI(base_url='http://localhost:8000/v1', api_key='x')
r = c.chat.completions.create(model='Qwen/Qwen2.5-7B-Instruct', messages=[{'role':'user','content':'Hi!'}])
print(r.choices[0].message.content)
"

# 6. Production: Docker with GPU
docker run --runtime nvidia --gpus all -p 8000:8000 \
    vllm/vllm-openai:latest \
    --model Qwen/Qwen2.5-7B-Instruct-AWQ \
    --quantization awq \
    --enable-prefix-caching

vLLM has become the de facto standard for open-source LLM serving because it solves the real-world problems — memory efficiency, throughput, and drop-in compatibility — without requiring you to rewrite your application. Start with the OpenAI-compatible server mode, measure your throughput with the benchmark tool, then tune from there. The jump from naïve model.generate() to vLLM is one of the highest-return investments you can make when moving an LLM application from prototype to production.

Questions or discussion? Connect on LinkedIn, X or reach out via email.

Discussion

Have thoughts on this post? Share them below — questions, corrections, or your own experience are all welcome.