AI¶

June 30, 2026
in AI, Architect
10 min read

Scaling LLM Inference: DP, PP, and TP with vLLM

You sized the model and it doesn't fit on one GPU — or it fits but can't keep up with traffic. Both problems are solved by spreading the model across cards, but with different knobs. Pick the wrong one and you'll waste GPUs, melt your latency, or both. This post is the decision tree.

June 26, 2026
in Kubernetes, AI, DevOps
9 min read

GPUs on Kubernetes: From Bare Metal to Schedulable in One Operator

A fresh Kubernetes cluster has no idea your nodes have GPUs. kubectl describe node shows CPU, memory, and pods — nothing else. To make a pod request a GPU you need a driver, a container runtime hook, and a device plugin advertising the hardware to the scheduler, all version-matched across every GPU node. Do it by hand and you'll re-do it on every kernel bump. This post wires it up the way you actually want — one operator — and works on any Kubernetes, not a specific vendor's distro.

June 26, 2026
in Kubernetes, AI, DevOps
5 min read

Stacking MIG and Time-Slicing on One GPU Operator values.yaml

MIG carves a GPU into hardware-isolated slices. Time-slicing oversubscribes each slice so more pods can share it. Stack them and one physical GPU advertises far more schedulable units than it has silicon — useful when you have more workloads than GPUs and most of them sit idle. Here's the exact values.yaml, wired into the kommander-applications GPU Operator 26.3.0 app¹, and the labels that switch a node between layouts.

June 26, 2026
in Kubernetes, AI, Architect
14 min read

Build Your Own Token-as-a-Service: A Self-Hosted OpenAI-Compatible AI Gateway on Kubernetes

Public LLM APIs bill you per token and leak your prompts off-prem. The fix isn't "run Ollama on a box" — it's a centralized, OpenAI-compatible inference platform your whole org consumes like a utility: one API key, token-based quotas, models that stay in your datacenter. Call it token-as-a-service. This post is the architecture — model tiering, GPU math, MIG partitioning, vLLM tuning, and KV-cache-aware routing — generic enough to run on any Kubernetes cluster with NVIDIA GPUs.

June 26, 2026
in Kubernetes, AI, Architect
10 min read

GPU DRA on Kubernetes: migrating off the NVIDIA device plugin, end-to-end

Dynamic Resource Allocation went GA in Kubernetes 1.35¹, and NVIDIA's DRA driver for GPUs shipped v0.4.1 on 2026-06-30². The nvidia.com/gpu extended resource model — the one the device plugin has used since 2017 — now has a sanctioned successor. The migration is non-trivial because the resource model is completely different (claim-based, not extended-resource-based), but the payoff is real: structured GPU allocation, native NVLink topology awareness, and a single API for MIG, MPS, time-slicing, and Multi-Node NVLink.

This post is the migration path: install the driver, write the first claim, swap one workload at a time, and run device plugin + DRA side-by-side until you're done.

June 17, 2026
in Kubernetes, AI
11 min read

You have 8 H100s. You have 30 inference pods that need one GPU each. The naive answer — buy 22 more H100s — is a quarter-million-dollar mistake. The right answer is GPU sharing, and there are four of them now: time-slicing, MIG, MPS, and DRA. Pick wrong and you get OOM kills, throughput collapse, or training jobs that silently corrupt each other. Pick right and you run 30 pods on 8 GPUs with measurable isolation guarantees.

This post walks each option with the actual config, the actual tradeoffs, and the decision tree that tells you which one to use for which LLM workload.

June 17, 2026
in AI, Architect
7 min read

How Much VRAM Does Your LLM Actually Need? A Field Guide to Sizing GPUs

"Will this model fit on my GPU?" has one honest answer: do the arithmetic. It's three numbers added together, and you can pull every input from a model's config.json in about two minutes. This post turns a sizing spreadsheet into a method you can run by hand.

June 17, 2026
in AI, Kubernetes, Architect
9 min read

An On-Premise RAG Reference Architecture for 100 Users: Right-Sized GPUs, HA by Design

You can build a production RAG platform on your own hardware for the price of two years of API bills, and it fits in half a rack. The trick is sizing the GPUs to the models instead of buying the biggest card on the truck. This is the full design for a 100-user, highly available, hybrid-plus-graph RAG stack, with every GPU choice backed by the VRAM math.

June 16, 2026
in AI, Kubernetes, Architect
32 min read

Gpu for llm workloads reference

LLM workloads are not "regular workloads that happen to need more RAM." They are memory-bandwidth-bound during decode, compute-bound during prefill, and topology-bound the moment a model can't fit on one GPU. The hardware spec sheet, the VM config, the container runtime, the Kubernetes device plugin, and the multi-GPU pattern you pick are all the same decision at different layers. Get one wrong and the others stop mattering.

This is a reference for the whole stack — from the HBM3e bandwidth of a Blackwell GPU to the nvidia.com/gpu resource advertised by a DaemonSet to the tensor-parallel group size that decides whether your 70B model serves or stalls. Every number is verified against a primary source, dated 2026-06-16.

June 15, 2026
in AI
14 min read

Beyond the stack trace: observability and verification for async AI agents in production

Your agent traces are beautiful and they still won't save you at 2 AM. You can see every LLM call, every token, every tool invocation in Langfuse — and you still can't prove what the agent decided, why, or whether it was allowed to. Observability tells you what happened; it doesn't prove it happened the way it was supposed to.

That gap is the new runtime problem, and the industry noticed this week: Diagrid shipped Verifiable Execution in Dapr 1.18 on June 11, and The New Stack ran a piece arguing agent verification is a runtime concern, not a test-time one.¹²