AI¶

June 15, 2026
in AI, Kubernetes, Architect
20 min read

Replacing Claude Code With a Self-Hosted LLM on Kubernetes: A Production Reference

A single Qwen3-Coder-30B-A3B instance on Kubernetes (vLLM 0.23.0, one H100 80GB, $6.88/GPU-hour on AWS) produces code at "results comparable to Claude Sonnet" on agentic coding benchmarks, at roughly one-quarter to one-sixth the all-in cost of the hosted Anthropic API at 100-engineer scale.¹ Those numbers are real, and so are the trade-offs. The post rests on one bet: at scale the model is the cheap line on the invoice, and what decides whether you ship is everything around it. That means autoscaling, observability, the security review, the tool-calling schema, and network egress.

I wrote this as a reference, not a sales pitch. It covers what Claude Code costs in production, what the vLLM-on-Kubernetes stack looks like in mid-2026, what you give up when you cut the API cord, and the six things that break first.

June 14, 2026
in Kubernetes, AI
9 min read

Serving an LLM on Kubernetes in 2026: the operations checklist nobody gave you

You can helm install vllm in an afternoon. You cannot tell me your p99 TTFT at 70% GPU utilization — and that's the gap this post is about. The "deploy vLLM on Kubernetes" tutorial is saturated. The decision tree underneath it isn't written anywhere, so most clusters are still pinned to whatever they stood up 12 months ago.

The engine layer moved while you weren't looking: Tiny-vLLM (May 29), KVarN KV-cache quantization (June 4), Kvcached for elastic KV cache, Expanse reclaiming idle GPU, and vLLM's wide expert-parallel path pushing ~2.2k tok/s on H200s. (All five are bleeding-edge — verify the names, dates, and numbers before you publish or pin them.) Meanwhile your Helm chart hasn't changed since last summer.

June 12, 2026
in AI, Cloud, Kubernetes
9 min read

Agents in production need a gateway, not a wrapper: the MCP + policy + observability stack for cloud-native AI

If your "AI agent" is just an LLM with a requests library and a service account, you don't have an agent — you have a future incident. Every team that has shipped agents to production learned the same lesson: the LLM is the easy 10%. The hard part is everything around it.

I'm going to walk through the stack I now consider non-negotiable: MCP for tools, a gateway for policy and routing, and OpenTelemetry for observability. I'll show configs, not concepts.

June 12, 2026
in AI, Cloud
11 min read

The Cloud Engineer's Guide to AI Agents That Actually Do Things

There's a chasm right now between two worlds. AI builders ship agents that demo beautifully — "look, it booked my flight!" — but hand them real cloud access and they spin up 200 EC2 instances, leak IAM keys into logs, or get prompt-injected into deleting a production S3 bucket. Cloud engineers know how to build systems that don't fall over, but treat "AI agent" as a magic box: block it entirely, or hand it root keys and hope.

The middle ground is empty — and that's where the value is. This is a guide to building agents that mutate cloud infrastructure without burning it down: the architecture, the failure modes, and a concrete build you can copy.

June 12, 2026
in AI, Kubernetes
11 min read

A Minimal Open-Source AI Platform: Laptop First, Kubernetes Later

You don't need a GPU cluster to build a real AI platform. Seven containers and one docker compose up give you an OpenAI-compatible gateway, a chat UI, local models, and full observability — on your laptop. And because the API contract never changes, the same architecture maps one-to-one onto production Kubernetes.

June 12, 2026
in AI, Kubernetes
8 min read

Ran Your Own LLM on Kubernetes for 30 Days — Here's the Real Cost

The cloud AI cost conversation is two camps shouting past each other: vendor blogs saying "our managed service is cheaper" and Reddit threads saying "self-hosting saves 90%." Neither shows real bills, so neither is useful.

This post shows the real numbers: 30 days, real production traffic, four configurations, every line item. Spoiler — self-hosting was 31% cheaper on the cloud bill and 33% more expensive once you count engineering time. Both facts are true. The details are what matter.

June 12, 2026
in General, AI, Practices
12 min read

You can't use a USB drive as VRAM — the enterprise guide to GPU memory capacity planning in 2026

Storage isn't VRAM. eGPUs aren't a data center strategy. Shared memory isn't a capacity plan. Every quarter, an AI infrastructure team somewhere asks the same question: "we're running out of GPU memory, can we just use the SSDs?" No. Here's what actually works at scale, what doesn't, and the procurement and capacity planning playbook for GPU memory in 2026.

This post is written for AI platform engineers, infrastructure architects, and FinOps leads running shared GPU clusters. It's not about a single workstation — it's about a fleet. The unit of analysis is the rack, the budget, and the quarter.

Once the fleet exists, two companion posts cover what runs on it: engine selection and quantization (which engine, which precision, when to call an API instead) and GPU Autoscaling is Broken (scaling LLM inference under real load). This one is the layer above both: how much GPU memory to buy in the first place.

June 12, 2026
in Development, Architect, AI
10 min read

The backend-for-frontend pattern for AI apps: thin client, fat API, one job

Most AI apps in 2026 still ship as "browser → OpenAI" with an API key in the frontend bundle. This is broken in every dimension that matters: security, cost, observability, provider lock-in, and rate limiting. The fix is a Backend-for-Frontend (BFF) — a thin server between the browser and the LLM provider that owns auth, cost, streaming, and provider abstraction. The browser does UI. The BFF does everything else.

This is the pattern every serious AI product has converged on by 2026. Here's what it looks like, why each piece exists, and how to build it without over-engineering.

June 12, 2026
in Development, AI, Architect
11 min read

Streaming responses in 2026: SSE vs WebSockets vs gRPC for AI apps

Every AI app needs to stream. Most teams pick the wrong transport and spend a week debugging why the first token takes 8 seconds. The LLM is fast — the network in front of it isn't. This post compares the three transports that actually work in production (SSE, WebSockets, gRPC streaming), shows the buffering traps that kill streaming, and gives you the decision framework to pick the right one in 30 seconds.

If you've ever shipped a "ChatGPT-style" streaming endpoint and watched it work in dev, then break in staging behind nginx, this is the post for you.

June 12, 2026
in Kubernetes, AI
10 min read

GPU Autoscaling is Broken: What I Learned Scaling LLM Inference to 10K QPS

Standard Kubernetes autoscaling assumes more load = more pods = more capacity. With stateless REST APIs, that works. With LLM inference, it falls apart — and it took us three months of production pain at 10K QPS to figure out why.

This post covers the four patterns that actually worked, the exact configs we run, and the numbers before and after: p99 latency from 12s down to 1.8s, OOM kills from 3% to under 0.1%, GPU utilization from 40% to 75%.

Two companions go alongside this one: engine selection and quantization (which engine and precision to run before you scale anything) and the GPU memory capacity-planning guide (how much GPU to buy in the first place).