Blog¶

June 17, 2026
in AI, Kubernetes, Architect
9 min read

An On-Premise RAG Reference Architecture for 100 Users: Right-Sized GPUs, HA by Design

You can build a production RAG platform on your own hardware for the price of two years of API bills, and it fits in half a rack. The trick is sizing the GPUs to the models instead of buying the biggest card on the truck. This is the full design for a 100-user, highly available, hybrid-plus-graph RAG stack, with every GPU choice backed by the VRAM math.

June 17, 2026
in Kubernetes, Data
14 min read

Stateful ai workloads on kubernetes

Every RAG team hits the same wall six months in. They started with PostgreSQL and pgvector because it was the path of least resistance, the embeddings worked, the retrieval was fine, and nobody had to learn a new system. Then the vector count crossed some invisible threshold — usually around 10M–50M, depending on dimensionality — and recall started sliding, query latency started climbing, and the migration conversation began. The problem: migrating from pgvector to a dedicated vector database while you have live embeddings in production is brutal, and most teams either over-engineer (spinning up a Weaviate cluster for 10K vectors) or under-engineer (running pgvector into the ground at 50M).

This is the decision tree I wish someone had handed me. It's tuned for Kubernetes, June 2026, and every number is from a primary source.

June 16, 2026
in AI, Kubernetes, Architect
32 min read

Gpu for llm workloads reference

LLM workloads are not "regular workloads that happen to need more RAM." They are memory-bandwidth-bound during decode, compute-bound during prefill, and topology-bound the moment a model can't fit on one GPU. The hardware spec sheet, the VM config, the container runtime, the Kubernetes device plugin, and the multi-GPU pattern you pick are all the same decision at different layers. Get one wrong and the others stop mattering.

This is a reference for the whole stack — from the HBM3e bandwidth of a Blackwell GPU to the nvidia.com/gpu resource advertised by a DaemonSet to the tensor-parallel group size that decides whether your 70B model serves or stalls. Every number is verified against a primary source, dated 2026-06-16.

June 15, 2026
in AI
14 min read

Beyond the stack trace: observability and verification for async AI agents in production

Your agent traces are beautiful and they still won't save you at 2 AM. You can see every LLM call, every token, every tool invocation in Langfuse — and you still can't prove what the agent decided, why, or whether it was allowed to. Observability tells you what happened; it doesn't prove it happened the way it was supposed to.

That gap is the new runtime problem, and the industry noticed this week: Diagrid shipped Verifiable Execution in Dapr 1.18 on June 11, and The New Stack ran a piece arguing agent verification is a runtime concern, not a test-time one.¹²

June 15, 2026
in AI, Kubernetes, Architect
20 min read

Replacing Claude Code With a Self-Hosted LLM on Kubernetes: A Production Reference

A single Qwen3-Coder-30B-A3B instance on Kubernetes (vLLM 0.23.0, one H100 80GB, $6.88/GPU-hour on AWS) produces code at "results comparable to Claude Sonnet" on agentic coding benchmarks, at roughly one-quarter to one-sixth the all-in cost of the hosted Anthropic API at 100-engineer scale.¹ Those numbers are real, and so are the trade-offs. The post rests on one bet: at scale the model is the cheap line on the invoice, and what decides whether you ship is everything around it. That means autoscaling, observability, the security review, the tool-calling schema, and network egress.

I wrote this as a reference, not a sales pitch. It covers what Claude Code costs in production, what the vLLM-on-Kubernetes stack looks like in mid-2026, what you give up when you cut the API cord, and the six things that break first.

June 14, 2026
in Kubernetes, AI
9 min read

Serving an LLM on Kubernetes in 2026: the operations checklist nobody gave you

You can helm install vllm in an afternoon. You cannot tell me your p99 TTFT at 70% GPU utilization — and that's the gap this post is about. The "deploy vLLM on Kubernetes" tutorial is saturated. The decision tree underneath it isn't written anywhere, so most clusters are still pinned to whatever they stood up 12 months ago.

The engine layer moved while you weren't looking: Tiny-vLLM (May 29), KVarN KV-cache quantization (June 4), Kvcached for elastic KV cache, Expanse reclaiming idle GPU, and vLLM's wide expert-parallel path pushing ~2.2k tok/s on H200s. (All five are bleeding-edge — verify the names, dates, and numbers before you publish or pin them.) Meanwhile your Helm chart hasn't changed since last summer.

June 12, 2026
in AI, Cloud, Kubernetes
9 min read

Agents in production need a gateway, not a wrapper: the MCP + policy + observability stack for cloud-native AI

If your "AI agent" is just an LLM with a requests library and a service account, you don't have an agent — you have a future incident. Every team that has shipped agents to production learned the same lesson: the LLM is the easy 10%. The hard part is everything around it.

I'm going to walk through the stack I now consider non-negotiable: MCP for tools, a gateway for policy and routing, and OpenTelemetry for observability. I'll show configs, not concepts.

June 12, 2026
in AI, Cloud
11 min read

The Cloud Engineer's Guide to AI Agents That Actually Do Things

There's a chasm right now between two worlds. AI builders ship agents that demo beautifully — "look, it booked my flight!" — but hand them real cloud access and they spin up 200 EC2 instances, leak IAM keys into logs, or get prompt-injected into deleting a production S3 bucket. Cloud engineers know how to build systems that don't fall over, but treat "AI agent" as a magic box: block it entirely, or hand it root keys and hope.

The middle ground is empty — and that's where the value is. This is a guide to building agents that mutate cloud infrastructure without burning it down: the architecture, the failure modes, and a concrete build you can copy.

June 12, 2026
in AI, Kubernetes
11 min read

A Minimal Open-Source AI Platform: Laptop First, Kubernetes Later

You don't need a GPU cluster to build a real AI platform. Seven containers and one docker compose up give you an OpenAI-compatible gateway, a chat UI, local models, and full observability — on your laptop. And because the API contract never changes, the same architecture maps one-to-one onto production Kubernetes.

June 12, 2026
in AI, Kubernetes
8 min read

Ran Your Own LLM on Kubernetes for 30 Days — Here's the Real Cost

The cloud AI cost conversation is two camps shouting past each other: vendor blogs saying "our managed service is cheaper" and Reddit threads saying "self-hosting saves 90%." Neither shows real bills, so neither is useful.

This post shows the real numbers: 30 days, real production traffic, four configurations, every line item. Spoiler — self-hosting was 31% cheaper on the cloud bill and 33% more expensive once you count engineering time. Both facts are true. The details are what matter.